WordSmith Tools Manual


1 WordSmith Tools Manual Version 6.0 © 2015 Mike Scott Lexical Analysis Software Ltd. Stroud, Gloucestershire, UK

2 WordSmith Tools Manual version 6.0 by M ike Scott 2015

3 WordSmith Tools Manual © 2015 Mike Scott All rights reserved. But m ost parts of this work m ay be reproduced in any form or by any m eans - graphic, electronic, or m echanical, including photocopying, recording, taping, or inform ation storage and retrieval system s - usually without the written perm ission of the publisher. See http://www.lexically.net/publications/copyright_perm ission_for_screenshots.htm Products that are referred to in this docum ent m ay be either tradem arks and/or registered tradem arks of the respective owners. The publisher and the author m ake no claim to these tradem arks. While every precaution has been taken in the preparation of this docum ent, the publisher and the author assum e no responsibility for errors or om issions, or for dam ages resulting from the use of inform ation contained in this docum ent or from the use of program s and source code that m ay accom pany it. In no event shall the publisher and the author be liable for any loss of profit or any other com m ercial dam age caused or alleged to have been caused directly or indirectly by this docum ent. Produced: Decem ber 2015 Special thanks to: Publisher All the people who contributed to this document by testing Lexical Analysis Software Ltd. WordSmith Tools in its various incarnations. Especially those who reported problems and sent me suggestions.

4 WordSmith Tools Manual I Table of Contents Foreword 0 1 Part I WordSmith Tools 3 Overview Part II ... 4 1 Requirements ... 4 2 What's new in version 6 3 Controller ... 4 ... 5 4 Concord 5 KeyWords ... 6 6 ... 6 WordList 7 ... 6 Utilities ... 6 Character Profiler ... 7 CharGram s ... 7 Choose Languages ... 7 Corpus Corruption Detector ... 8 File Utilities ... 8 Splitter ... 8 File View er ... 8 Minim al Pairs Text Converter ... 9 ... 9 Version Checker View er and Aligner ... 11 ... 12 Webgetter ... 12 WSConcGram 13 Part III Getting Started 1 getting started with Concord ... 14 2 getting started with KeyWords ... 15 ... 17 3 getting started with WordList 20 Part IV Installation and Updating ... 21 1 installing WordSmith Tools ... 22 what your licence allows 2 ... 23 3 site licence defaults ... 24 4 version checking 26 Controller Part V ... 28 1 characters and letters ... 28 accents and other characters ... 28 w ildcards add notes ... 29 2 © 2015 Mike Scott

5 II Contents 3 adjust settings ... 29 advanced settings 4 ... 31 ... 39 batch processing 5 6 ... 42 choosing texts ... 42 text form ats ... 44 the file-choose w indow favourite texts ... 50 choosing files from standard dialogue box ... 51 7 8 ... 51 class or session instructions ... 51 9 colour categories 10 ... 60 colours ... 62 column totals 11 12 ... 63 compute new column of data ... 66 13 copy your results ... 66 14 count data frequencies 15 custom processing ... 67 16 ... 70 editing reduce data to n entries ... 70 ... 71 reverse-delete ... 71 delete or restore to end ... 71 delete if editing colum n headings ... 71 ... 72 editing a list of data find relevant files 17 ... 75 18 folder settings ... 78 ... 78 19 fonts 20 main settings ... 80 21 language ... 81 Overview ... 83 Language Chooser ... 84 ... 86 Other Languages Font ... 86 Sort Order ... 86 ... 87 saving your choices ... 87 22 layout & format 23 match words in list ... 92 ... 97 previous lists 24 ... 97 25 print and print preview ... 101 26 quit WordSmith ... 101 27 saving ... 101 save results ... 102 save as text ... 106 28 scripting searching ... 109 29 © 2015 Mike Scott

6 WordSmith Tools Manual III search for w ord or part of w ord ... 109 ... 109 search by typing search & replace ... 110 ... 111 selecting or marking entries 30 ... 111 selecting m ultiple entries ... 112 m arking entries ... 113 filenames tab 31 32 ... 113 settings ... 113 save defaults ... 115 restoring settings ... 116 33 source text(s) ... 120 34 stop lists 35 suspend processing ... 123 ... 125 text and languages 36 ... 126 37 text dates and time-lines ... 127 38 window management ... 128 word clouds 39 ... 129 40 zap unwanted lines 130 Tags and Markup Part VI ... 131 1 overview choices in handling tags ... 132 2 ... 133 3 custom settings tags as selectors ... 134 4 5 only if containing... ... 137 ... 138 6 part of file:selecting within texts 7 making a tag file ... 141 8 tag-types ... 145 9 start and end of text segments ... 146 ... 147 10 multimedia tags ... 149 11 modify source texts ... 153 12 XML text 157 Part VII Concord 1 purpose ... 158 ... 158 index 2 ... 159 3 what is a concordance? ... 159 4 search-word or phrase ... 159 search w ord syntax ... 161 file-based search-w ords ... 163 search-w ord and other settings ... 165 5 advice blanking ... 168 6 © 2015 Mike Scott

7 IV Contents 7 ... 168 Category or Set ... 168 set colum n categories ... 171 colour categories ... 175 8 clusters 9 Collocation ... 179 w hat is collocation? ... 179 collocate horizons ... 180 ... 180 collocation relationship ... 181 collocates display ... 183 collocates and lem m as ... 184 collocate follow ... 186 collocate highlighting in concordance ... 187 collocate settings re-sorting: collocates ... 189 ... 191 10 dispersion plot 11 concordancing on tags ... 198 ... 199 nearest tag ... 202 12 context word ... 206 editing concordances 13 ... 207 rem ove duplicates ... 207 14 patterns ... 208 15 re-sorting re-sorting: dispersion plot ... 210 ... 211 16 saving and printing sounds & video ... 212 17 obtaining sound and video files ... 215 ... 215 18 summary statistics 19 text segments in Concord ... 218 20 viewing options ... 220 21 WordSmith controller: Concord: settings ... 222 228 Part VIII KeyWords ... 229 1 purpose 2 index ... 229 ... 230 3 ordinary two word-list analysis 4 choosing files ... 232 ... 234 concordance 5 ... 235 6 example of key words ... 235 7 keyness ... 235 p value ... 236 key-ness definition ... 237 thinking about keyness ... 237 8 KeyWords database ... 238 creating a database key key-w ord definition ... 240 © 2015 Mike Scott

8 WordSmith Tools Manual V associates ... 241 associate def inition ... 243 ... 243 keyw ords database related clusters ... 243 clum ps ... 244 regrouping clumps 9 KeyWords: advice ... 244 10 ... 245 KeyWords: calculation 11 KeyWords clusters ... 246 ... 247 KeyWords: links 12 ... 249 13 make a word list from keywords data ... 249 14 plot display ... 251 plot calculation 15 re-sorting: KeyWords ... 252 16 the key words screen ... 252 17 WordSmith controller: KeyWords settings ... 254 257 WordList Part IX ... 258 purpose 1 ... 258 2 index ... 259 3 compare word lists ... 259 com pute key w ords com paring w ordlists ... 260 ... 261 com parison display merging wordlists 4 ... 262 5 consistency ... 262 ... 262 consistency analysis (range) detailed consistency analysis ... 263 ... 268 re-sorting: consistency lists detailed consistency relations ... 268 6 find filenames ... 269 ... 270 7 Lemmas (joining words) w hat are lem m as and how do w e join w ords? ... 270 m anual joining ... 271 ... 273 auto-joining lem m as ... 274 choosing lem m a file ... 276 8 WordList Index w hat is an Index for? ... 276 ... 276 m aking a WordList Index ... 278 index clusters ... 283 join clusters ... 284 index lists: view ing ... 286 index exporting ... 288 9 menu search ... 289 10 relationships between words ... 289 m utual inform ation and other relations ... 289 relationships display relationships com puting ... 294 © 2015 Mike Scott

9 VI Contents 11 recompute tokens ... 297 statistics ... 298 12 ... 298 statistics type/token ratios ... 303 ... 304 sum m ary statistics ... 306 stop-lists and match-lists 13 ... 307 import words from text list 14 15 ... 309 settings ... 310 WordSm ith controller: Index settings ... 311 WordSm ith controller: WordList settings ... 314 m inim um & m axim um settings ... 314 case sensitivity 16 sorting ... 315 ... 315 WordList and tags 17 ... 318 18 WordList display 322 Utility Programs Part X ... 323 1 Convert Data from Previous Versions ... 323 Convert Data from Previous Versions ... 324 2 WebGetter ... 324 overview ... 325 settings display ... 326 ... 328 lim itations 3 Corpus Corruption Detector ... 329 ... 329 Aim ... 329 How it w orks ... 331 Minimal Pairs 4 ... 331 aim ... 332 requirem ents ... 333 choosing your files output ... 336 ... 337 rules and settings running the program ... 338 ... 340 5 File Viewer ... 340 Using File View er ... 342 6 File Utilities index ... 342 ... 343 Splitter ... 343 Splitter: index ... 343 aim of Splitter ... 344 Splitter: f ilenames ... 345 Splitter: w ildcards ... 346 join text files ... 347 com pare tw o files ... 348 file chunker ... 348 find duplicates ... 349 renam e m ove files to sub-folders ... 351 © 2015 Mike Scott

10 WordSmith Tools Manual VII dates and tim es ... 352 ... 354 find holes in texts Text Converter 7 ... 354 purpose ... 354 ... 355 index settings ... 355 Text Converter: copy to ... 358 extracting from files ... 359 ... 360 filtering: m ove if ... 362 Convert w ithin the text file ... 362 conversion f ile ... 363 syntax ... 364 sample conversion f ile ... 365 Convert form at of entire text files Mark-up changes ... 368 ... 371 Word, Excel, PDF ... 372 non-Unicode Text ... 373 Other changes ... 374 Text Converter: converting BNC XML version ... 379 8 Viewer and Aligner ... 379 purpose ... 380 index ... 381 aligning w ith View er & Aligner ... 381 exam ple of aligning ... 384 aligning and m oving editing ... 385 ... 385 languages num bering sentences & paragraphs ... 386 ... 387 options ... 387 reading in a plain text ... 389 joining and splitting settings ... 390 ... 390 technical aspects translation m is-m atches ... 391 troubleshooting ... 391 unusual lines ... 392 9 WSConcGram ... 393 aim s ... 393 ... 394 definition of a concgram ... 395 settings ... 395 generating concgram s view ing concgram s ... 397 ... 402 filtering concgram s exporting concgram s ... 404 ... 405 10 Character Profiler ... 405 purpose ... 405 profiling characters ... 408 profiling settings ... 409 11 Chargrams ... 409 purpose ... 409 chargram procedure display ... 410 © 2015 Mike Scott

11 VIII Contents settings ... 413 415 Part XI Reference acknowledgements 1 ... 416 2 API ... 416 ... 417 3 bibliography bugs ... 417 4 change language ... 419 5 ... 419 6 Character Sets ... 419 overview ... 420 accents & sym bols ... 422 clipboard 7 ... 425 contact addresses 8 9 ... 425 date format ... 425 10 Definitions ... 425 definitions ... 427 w ord separators 11 ... 427 demonstration version ... 427 12 drag and drop ... 428 13 edit v. type-in mode ... 428 14 file extensions ... 430 finding source texts 15 ... 430 16 flavours of Unicode folders\directories ... 431 17 ... 433 18 formulae ... 435 19 history list 20 HTML, SGML and XML ... 435 21 hyphens ... 435 ... 436 22 international versions 23 limitations ... 437 tool-specific lim itations ... 437 ... 438 24 links between tools ... 439 25 keyboard shortcuts 26 machine requirements ... 440 ... 440 manual for WordSmith Tools 27 ... 441 28 menu and button options ... 444 29 MS Word documents ... 446 30 never used WordSmith before ... 446 31 numbers ... 446 32 plot dispersion value RAM availability ... 447 33 © 2015 Mike Scott

12 WordSmith Tools Manual IX 34 ... 447 reference corpus ... 447 35 restore last file ... 448 36 single words v. clusters 37 speed ... 448 38 ... 449 status bar 39 tools for pattern-spotting ... 450 40 version information ... 451 ... 452 Version 3 im provem ents ... 453 41 zip files 454 Part XII Troubleshooting 1 list of FAQs ... 455 ... 455 apostrophes not found 2 ... 455 3 column spacing ... 455 4 Concord tags problem ... 456 Concord/WordList mismatch 5 ... 456 6 crashed ... 456 7 demo limit ... 456 8 funny symbols illegible colours ... 457 9 ... 457 10 keys don't respond pineapple-slicing ... 458 11 12 printer didn't print ... 458 ... 458 13 too slow 14 won't start ... 458 15 word list out of order ... 459 460 Error Messages Part XIII ... 461 1 list of error messages ... 462 2 .ini file not found ... 462 3 administrator rights ... 463 4 base list error 5 can only save words as ASCII ... 463 ... 463 can't call other tool 6 ... 463 7 can't make folder as that's an existing filename ... 463 8 can't compute key words as languages differ ... 463 9 can't merge list with itself! ... 463 10 can't read file ... 464 11 character set reset to to suit concordance file is faulty ... 464 12 © 2015 Mike Scott

13 X Contents 13 concordance stop list file not found ... 464 confirmation messages: okay to re-read 14 ... 464 15 conversion file not found ... 464 ... 465 16 destination folder not found 17 disk problem -- file not saved ... 465 ... 465 18 dispersions go with concordances 19 drive not valid ... 465 ... 465 20 failed to access Internet ... 465 21 failed to create new folder name 22 ... 465 failed to read file ... 466 failed to save file 23 24 ... 466 file access denied ... 466 25 file contains none of the tags specified ... 466 26 file has "holes" 27 file not found ... 466 ... 467 28 filenames must differ! 29 folder is read-only ... 467 ... 467 for use on X machine only 30 ... 467 31 form incomplete ... 467 full drive & folder name needed 32 ... 467 33 function not working properly yet invalid concordance file ... 467 34 35 invalid file name ... 468 ... 468 36 invalid KeyWords database file 37 invalid KeyWords calculation ... 468 38 invalid WordList comparison file ... 468 39 invalid WordList file ... 468 ... 468 40 joining limit reached ... 469 41 KeyWords database file is faulty ... 469 42 KeyWords file is faulty ... 469 43 limit of file-based search-words reached 44 links between Tools disrupted ... 469 ... 469 match list details not specified 45 ... 469 46 must be a number ... 470 47 mutual information incompatible ... 470 48 network registration used elsewhere ... 470 49 no access to text file - in use elsewhere? ... 470 50 no associates found no clumps identified ... 470 51 © 2015 Mike Scott

14 WordSmith Tools Manual XI 52 no clusters found ... 470 no collocates found 53 ... 470 54 no concordance entries ... 471 ... 471 55 no concordance stop list words 56 no deleted lines to zap ... 471 ... 471 57 no entries in KeyWords database 58 no fonts available ... 471 ... 471 59 no key words found ... 472 60 no key words to plot 61 ... 472 no KeyWords stop list words ... 472 no lemma list words 62 63 ... 472 no match list words ... 472 64 no room for computed variable ... 472 65 no statistics available 66 no stop list words ... 472 ... 472 67 no such file(s) found 68 no tag list words ... 473 ... 473 no word lists selected 69 ... 473 70 not a valid number ... 473 not a WordSmith file 71 ... 473 72 not a current WordSmith file nothing activated ... 473 73 74 Only X% of words found in reference corpus ... 474 ... 474 75 original text file needed but not found 76 printer needed ... 474 77 registration code in wrong format ... 474 78 registration is not correct ... 475 ... 475 79 short of memory ... 475 80 source folder file(s) not found ... 475 81 stop list file not found ... 475 82 stop list file not read 83 tag file not found ... 475 ... 475 tag file not read 84 ... 476 85 this function is not yet ready ... 476 86 this is a demo version ... 476 87 this program needs Windows XP or greater ... 476 88 to stop getting this message ... ... 476 89 too many requests to ignore matching clumps too many sentences ... 476 90 © 2015 Mike Scott

15 XII Contents 91 ... 476 truncating at xx words -- tag list file has more ... 476 92 two files needed 93 ... 476 unable to merge Keywords Databases ... 476 94 why did my search fail? ... 477 95 word list file is faulty 96 ... 477 word list file not found WordList comparison file is faulty 97 ... 477 ... 477 98 WordSmith Tools already running 99 WordSmith Tools expired ... 477 ... 477 100 WordSmith version mis-match ... 477 101 XX days left 478 Index © 2015 Mike Scott

16 WordSmith Tools Manual WordSmith Tools Section I

17 2 WordSmith Tools 1 WordSmith Tools WordSmith Tools is an integrated suite of programs for looking at how words behave in texts. You will be able to use the tools to find out how words are used in your own texts, or those of others. The WordList tool lets you see a list of all the words or word-clusters in a text, set out in alphabetical or frequency order. The concordancer, Concord , gives you a chance to see any word or phrase in context -- so that you can see what sort of company it keeps. With KeyWords you can find the key words in a text. The tools have been used by Oxford University Press for their own lexicographic work in preparing dictionaries, by language teachers and students, and by researchers investigating language patterns in lots of different languages in many countries world-wide. Getting Help Online step-by-step screenshots showing what WordSmith does. Most of the menus and dialogue boxes have help options. You can often get help just by pressing F1 or , or by choosing Help (at the right hand side of most menus). Within a help file (like this one) you may find it easiest to click the Search button and examine the index offered, or else just browse through the help screens. 15 14 17 , or KeyWords . , Concord See also: getting started straight away with WordList Version:6.0 © 2015 Mike Scott December, 2015. Page 2

18 WordSmith Tools Manual Overview Section II

19 4 Overview 2 Overview 2.1 Requirements WordSmith Tools requires 440 1. a reasonably up-to-date computer 440 2. running Windows XP or later 365 to plain text 3. your own collection of text in plain text format or converted 2.2 What's new in version 6 WordSmith is organic software! Version 5.0 was started in June 2007, three years after version 4.0 and has continued this organic policy of growth ever since ... now in 2015 we are at version 6.0 with improvements and new features. New features: 351 · Move files to sub-folders 60 Skins · 128 · Word Clouds 126 · Date handling & Time-lines 365 .docx files · 106 · Scripting 51 · Colour categories 281 · Phrase frames 184 · Collocate following 409 · Chargrams 2.3 Controller This program controls the Tools. It is the one which shows and alters current defaults, handles the choosing of text files, and calls up the different Tools. It will appear at the top left corner of your screen. 127 . You can minimise it, if you feel the screen is getting cluttered December, 2015. Page 4

20 5 WordSmith Tools Manual For a step-by-step view with screenshots, click here to visit the WordSmith website . Concord 2.4 159 Concord is a program which makes a concordance using plain text or web text files. To use it you will specify a search word, which Concord will seek in all the text files you have chosen. It will then present a concordance display, and give you access to information about collocates of the search word. December, 2015. Page 5

21 6 Overview 211 for later use, edited, printed, copied to your word-processor, or saved as Listings can be saved text files. 441 158 , The buttons See also: Concord Help Contents Page KeyWords 2.5 The purpose of this program is to locate and identify key words in a given text. To do so, it compares the words in the text with a reference set of words usually taken from a large corpus of text. Any word which is found to be outstanding in its frequency in the text is considered "key". The key words are presented in order of outstandingness. 251 . The distribution of the key words can be plotted 101 Listings can be saved for later use, edited, printed, copied to your word-processor, or saved as text files. This program needs access to 2 or more word lists, which must be created first, using the Word List 6 program. 229 441 , The buttons See also: KeyWords Help Contents Page WordList 2.6 419 WordList generates word lists based on one or more plain text or web text files. Word lists are 101 shown both in alphabetical and frequency order. They can be saved for later use, edited, printed, copied to your word-processor, or saved as text files. 258 See also: WordList Help Contents Page Utilities 2.7 Character Profiler 2.7.1 December, 2015. Page 6

22 7 WordSmith Tools Manual 426 are most frequent in a text or a set of A tool to help find out which characters or chargrams texts. The purpose could be to check out which characters or character sequences are most frequent (e.g. E followed by T will be most frequent, in normal English text the letter and ARE will be high THE frequency 3-chargrams), or it could be to check whether your text collection contains any oddities, such as accented characters or curly apostrophes you weren't expecting. 405 See also: Character Profiling CharGrams 2.7.2 426 (sequences of characters) are most frequent in a text or A tool to help find out which chargrams a set of texts. The purpose could be to check out which chargrams are most frequent e.g. in word-initial position, in the middle of a word, or at the end. 409 See also: Chargrams Tool 2.7.3 Choose Languages A tool for selecting Languages which you want to process. You will probably only need to do this once, when you first use WordSmith Tools. 84 See also: Choose Language Tool Corpus Corruption Detector 2.7.4 A tool to go through your corpus and seek out any text files which may have become corrupted. Works in any language. December, 2015. Page 7

23 8 Overview 329 See also: detecting corpus corruption 2.7.5 File Utilities Utilities to 347 compare two files · 348 cut large files into chunks · 348 find duplicate files · 349 · rename multiple files 343 · split large files into their component texts 346 a lot of small text files into merged text files join up · 354 in your text files find holes · Splitter Splitter is a component of File Utilities which splits large files into small ones for text analysis purposes. You can specify a symbol to represent the end of a text (e.g. ) and Splitter will go through a large file copying the text; each time it finds the symbol it will start a new text file. 343 See also: Splitter Help Contents Page File Viewer 2.7.6 A tool for viewing how your text files are formatted in great detail, character by character. 84 See also: File Viewer Index Minimal Pairs 2.7.7 a program to find typos and minimally-differing pairs of words. December, 2015. Page 8

24 9 WordSmith Tools Manual 332 333 336 337 331 , See also : aim , requirements , choosing your files , output , rules and settings 338 . running the program 2.7.8 Text Converter Text Converter is a general-purpose utility which you use for three main tasks: to edit your texts, to rename text files, to change file attributes, to move files into a new folder if they contain certain words or phrases. The main use is to replace strings in text files. It does a "search and replace" much as in word- processors, but it can do this on lots of text files, one after the other. As it does so, it can also replace any number of strings, not just one. It is very useful for going through large numbers of texts and re-formatting them as you prefer, e.g. taking out unnecessary spaces, ensuring only paragraphs have at their ends, changing accented characters. 355 See also: Text Converter Help Contents Page Version Checker 2.7.9 A tool to check whether any components of your current version need updating and if so, download . File | Web version check them for you. Accessed via the main Controller menu, When you run the program you'll see something like this: December, 2015. Page 9

25 10 Overview The various components of WordSmith are listed in the top window and the current version is compared with your present situation. If they are different, all the files in the relevant zip file will be starred (*) in the left margin. By default you will download to wherever WordSmith is already (the program in a program folder and settings etc. in a Documents folder) but you're free to choose somewhere else. Press Download if you wish to get the updated files. December, 2015. Page 10

26 11 WordSmith Tools Manual After the download, the various .zip files are checked (bottom right window) if downloaded successfully, and the Install button is now available for use. Install unzips all those which are checked. 2.7.10 Viewer and Aligner Viewer & Aligner is a utility which enables you to examine your files in various formats. It is called on by other Tools whenever you wish to see the source text. Viewer & Aligner can also be used simply to produce a copy of a text file with numbered sentences 386 381 or paragraphs or for aligning two or more versions of a text, showing alternate paragraphs or sentences of each. 380 See also: Viewer & Aligner Help Contents Page December, 2015. Page 11

27 12 Overview 2.7.11 Webgetter A tool to gather text from the Internet. The point of it... The idea is to build up your own corpus of texts, by downloading web pages with the help of a search engine. 328 325 326 324 , Limitations , Display , Settings See also: A fuller overview WSConcGram 2.7.12 394 . a tool for generating concgrams 393 395 , Running WSConcGram See also : Aims of WSConcGram December, 2015. Page 12

28 WordSmith Tools Manual Getting Started Section III

29 14 Getting Started 3 Getting Started 3.1 getting started with Concord For a step-by-step view with screenshots, visit the WordSmith website . 4 In the main WordSmith Tools window (the one with WordSmith Tools Controller in its title bar), choose the Tools option, and once that's opened up, you'll see the Concord button. Click and the Concord tool will start up. Choose File | New 44 or change your choice, You should now see a dialogue box which lets you choose your texts and make a new concordance, looking somewhat like this: (If you only see the window with Concord in its caption, choose ( ) and the Getting File | New Started window will open up.) December, 2015. Page 14

30 15 WordSmith Tools Manual 446 before you will find a text has been selected for you If you have never used WordSmith automatically to help you get started. 159 and then press OK ( ). You will need to specify a Search-Word or phrase While Concord is working, you may see a progress indicator like this. Here, we have 552 entries so far, and the last one in shows the context for , our search-word. worse 222 , but you can probably leave the default If you want to alter other settings, press Advanced settings as they are. 198 . Concord now searches through your text(s) looking for the search word or Tag 211 Don't forget to save the results (press Ctrl+F2 or ) if you want to keep the concordance for another time. 158 See also: Concord Help Contents . 3.2 getting started with KeyWords For a step-by-step view with screenshots, visit the WordSmith website . 4 In the main WordSmith Tools window (the one with WordSmith Tools Controller in its title bar), choose the Tools option, and once that's opened up, you'll see KeyWords. Click and KeyWords will open up. Choose File | New 232 You see a dialogue box which lets you choose your word-lists . December, 2015. Page 15

31 16 Getting Started You'll need to choose two word lists to make a key words list from: one based on a single text (or single corpus), and another one based on a corpus of texts, enough to make up a good reference corpus for comparison. You will see two lists of the word list files in your current word-list folder. If there aren't any there, go back to the WordList tool and make some word lists. Choose one small word list above, and a 447 reference corpus list below to compare it with. With your texts selected, you're ready to do a key words analysis. Click on make a keyword list now. 123 You'll find that KeyWords starts processing your file and a progress window in the main Controller shows a bar indicating how it's getting on. After KeyWords has finished, it will show you a list of the key words. The ones at the top are more "key" than those further down. December, 2015. Page 16

32 17 WordSmith Tools Manual 101 (press Ctrl+F2) if you want to keep the keyword list for another Don't forget to save the results time. 229 229 See also: KeyWords Help Contents , What's it for? 3.3 getting started with WordList . For a step-by-step view with screenshots, visit the WordSmith website I suggest you start by trying the WordList program. In the main WordSmith Tools window (the one 4 in its title bar), choose the Tools option, and once that's opened with WordSmith Tools Controller up, you'll see WordList. Click and WordList will open up. Choose File | New 44 You will see a dialogue box which lets you choose your texts or change your choice, and make a new word list. December, 2015. Page 17

33 18 Getting Started 446 before you will find a text has been selected for you If you have never used WordSmith automatically to help you get started. There are other settings which can be altered via the menu, but usually you can just go straight 39 ahead and make a new word list, individually or as a Batch . 123 You'll find that WordList starts processing your file(s) and a progress window in the main Controller shows a bar indicating how it's getting on. After WordList has finished making the list, you will see some windows showing the words from your text file in alphabetical order and in frequency 29 order, statistics, filenames, notes . December, 2015. Page 18

34 19 WordSmith Tools Manual 101 Don't forget to save the results (press Ctrl+F2 or ) if you want to keep the word list for another time. 258 See also: WordList Help Contents . December, 2015. Page 19

35 WordSmith Tools Manual Installation and Updating Section IV

36 21 WordSmith Tools Manual 4 Installation and Updating 4.1 installing WordSmith Tools 1. You have run or downloaded and then run one or more .exe files. 2. This will expand all the files needed for WordSmith Tools into the folder of your choice ( \wsmith6 by default). You can install to a removable drive if you wish (explained below). 3. Now run \wsmith6\wordsmith6.exe to get started. You will be asked to register. 427 Otherwise WordSmith will go through its paces as a Demonstration Version . Upon receipt of the registration code, run WordSmith Tools. If you have only just installed the registration program will start up automatically. If not you can run \wsmith6\WSRegister6.exe . Single User Licence Everything must correspond exactly to what you were given when you purchased. Paste in your Name as specified in your purchase email or screen and (if there are any in the registration) Other Details, and paste in the code. This name appears in the main window and whenever you access the About option (F9). Your software will then be fully enabled, and the Update from Demo menu option will disappear. (The WSRegister6.exe program will still be there in your \wsmith6 folder, and can be used if you ever need to re-register.) . See this \Program Files You may require Administrator rights to register an installation using link or search "run as Administrator". install to a removable drive You don't need to install to the C:\ drive -- you can install WordSmith on a USB drive such as a pen drive or memory stick, or a fast external hard drive. That way you can take WordSmith with you from one computer to the next. A pen drive will be a rather slow medium, but a fast external drive December, 2015. Page 21

37 22 Installation and Updating 113 , any folder names can be very satisfactory in terms of speed. If you save your default settings 431 which are on the external drive itself get the drive letter corrected automatically. Site Licence Follow the instructions at http://lexically.net/wordsmith/version6/faqs/network_installation.htm . Updating your version To update a version so as to get the very latest build of the program, just check the button in the Updates box. Or simply re-install afresh with a complete new download. To update a demo version, visit http://lexically.net/wordsmith/purchasing.htm for details of suppliers. If you make a mistake and your registration fails, you can try again. You can get a more recent version at the WordSmith home page . \wsmith6 folder. Your data may be in sub-folders of To un-install, just delete all the files in your or in sub-folders of your \wsmith6 Documents\wsmith6. 113 428 425 . , File types , Contact Addresses See also: Setting default options 4.2 what your licence allows In among the legal stuff you will find this, in relation to single user licences: SINGLE USER LICENCES Think of these as a licence for a person. December, 2015. Page 22

38 23 WordSmith Tools Manual You can install the product on a machine at your office and a machine at home. You may yourself use both copies of the product, but only one at a time. You cannot install the product on two machines, and then use both of those copies at the same time, or allow anyone else to use your copy of the product on the second machine. For instance, you cannot purchase one copy of the product and allow a friend or family member to use the product on the other machine. You may not, at any time, allow another user to install your copy of the product for his/her own use. SITE LICENCES Think of these as a licence for a given number of terminals. \wsmith6\user_licence.txt . Your licence doesn't expire but it is The full licence text is at limited to the major version you bought. So a WordSmith 5 licence won't work with WordSmith 6 and a WS 6 licence won't work with WS7 but you can re-download the program as needed from our site. 4.3 site licence defaults If you have bought a site licence, just install one copy of WordSmith on any shared drive accessible by all your users. Follow the instructions at http://lexically.net/wordsmith/version6/faqs/network_installation.htm . the wordsmith6.ini file This file is in the folder where you installed: in it you will see a section which allows you to specify exactly where each user should save their preferences. The following terms are used prohibited drives limited folder instructions folder network-read/write folder and an example would be [NETWORK] network-read/write folder=m:\Documents\wsmith6 (drive M: is to be used when running on the network as it's one any user can write to.) prohibited drives=xyz (X: Y: and Z: are drives you don't want your users to look in when choosing texts.) limited folder=v:\texts (V:\TEXTS -- and any sub-directories of it -- is where users will by default choose their corpus on your network; though they may of course look elsewhere in any other drives they control.) instructions folder=L:\English\WSmith instructions (when you run the software in a teaching session, you will put the instructions in that folder.) When a new user starts using WordSmith for the very first time, WordSmith will notice December, 2015. Page 23

39 24 Installation and Updating that it is running on a site-licence version and read the "network-read/write folder" information above. It will then try to automatically create the folder you have specified above (in theory you shouldn't need to do it yourself) and copy the various .ini and other settings files over from the folder on your server where the WordSmith program is, to that folder. Your life as an installer will be a lot easier if the drive and folder you specify is truly one your users can write to! For networked drives, because of a Microsoft security update involving HTML Help files, WordSmith will copy the wordsmith6.chm file to the user's Windows-allocated temporary folder. 51 22 , Class Instructions See also: What your licence allows 4.4 version checking You can check whether your version is up to date in WordSmith's main settings: and can set the check to occur regularly every month, week etc. wordsmith_version_check.exe) which Besides this, WordSmith comes with a utility ( enables you download the necessary upgrades and patches. In order to install these, WordSmith itself will need to close down. December, 2015. Page 24

40 25 WordSmith Tools Manual 451 9 See also: version information , version updating . December, 2015. Page 25

41 WordSmith Tools Manual Controller Section V

42 27 WordSmith Tools Manual 5 Controller The main WordSmith Controller is a window which holds all the numerous settings and behind the scenes tells each Tool what to do. You can start up only one Controller -- though you can start up numerous Concord windows and WordList windows etc. It is best to leave the Controller in one default position on your screen -- there is no advantage in maximizing its size. December, 2015. Page 27

43 28 Controller 5.1 characters and letters 5.1.1 accents and other characters This window shows a set of the characters available using Unicode. and below, the official name of the character selected. Selecting a character puts it into the clipboard ready to paste. 420 See also: Copying a character into Concord wildcards 5.1.2 Many WordSmith functions allow you a choice of wildcards: symbol meaning examples tele* * disregard the end of the word, *ness disregard a whole word *happi* December, 2015. Page 28

44 29 WordSmith Tools Manual book * hotel Engl??? ? any single character (including ?50.00 punctuation) will match here Engl^^^ ^ a single letter $# # any sequence of numbers, 0 to 9 £#.00 (To represent a genuine #,^,? or * , put each one in double quotes, eg. "?" "#" "^" "*" .) add notes 5.2 As WordSmith generates data, it will state the current relevant settings in the Notes tab and these 211 with your data. In this sample case the original work was done in 2008. In 2009, are saved mutual information was computed on that data, with certain specific settings. You may add to these notes, of course. For example, if you have done a concordance and sorted it 168 carefully using your own user-defined categories , you will probably want to list these and save the information for later use. If you need access to these notes outside WordSmith Tools, select the text using Shift and the 422 cursor arrows or the mouse, then copy it to the clipboard using Ctrl+C and paste into a word processor such as notepad. 5.3 adjust settings 4 . You will see tabs accessing them at There are a number of Settings windows in the Controller the left in he main Controller window. December, 2015. Page 29

45 30 Controller 113 Choose and save settings concerning: 78 · font 60 colours · 431 folders · 131 · tags 80 · general settings 92 match-lists · 120 · stop lists 270 · lemma lists 124 text and language settings · 222 · Concord Settings 254 KeyWords settings · 311 WordList settings · 31 advanced user specific settings · 276 · index file settings December, 2015. Page 30

46 31 WordSmith Tools Manual 5.4 advanced settings These are reached by clicking the Advanced Settings button visible in the Main settings page: and open up a whole new set of options 131 Tags & Markup 306 Lists 310 Index 106 Scripts December, 2015. Page 31

47 32 Controller Help, logging Help system access On a network, it is commonly the case that Microsoft protects users to such an extent that the usual .CHM help files show only their table of contents but no details. Here you can set the at the WordSmith URL. WordSmith help to access the local CHM file or the online Help Logging Logging is useful if you are getting strange results and wish to see details of how they were obtained. If this is enabled, WordSmith will save some idea of how your results are progressing in Advanced Settings | Help | Logging the log-file, which you see in the section in the Controller. Here you can optionally switch on or off logging and choose an appropriate file-name. If you switch it on at any time you will get a chance to clear the previous log-file. This log shows WordSmith working with the Aligner, at the stage where various languages are being loaded up. And here in a Concord process we see some details of the text files being read and processed, December, 2015. Page 32

48 33 WordSmith Tools Manual horrible seeking the search-word : The most straightforward way to use logging is 1. Find logging in the Help tab of Advanced settings. 2. Click the Activated box. You'll be asked whether you want any previous log cleared. 3. Carry on using WordSmith as desired, changing settings or using Concord or any other tool. From time to time or after WordSmith finishes, press the Refresh button visible above and read the output. It is a text file so it can be opened using any word processing software. If you have had trouble, looking at the last few lines may help by showing where processing stopped. If you want to log as WordSmith starts up, start in from the command line with the parameter / : log Start | Run | Cmd < > > | cd\wsmith6 < Enter > | wordsmith6 /log < Enter Enter (or wordsmith6 /log C:\temp\WSLog.txt to force use of C:\temp\WSLog.txt . If you do that, make sure the folder exists first.) December, 2015. Page 33

49 34 Controller 417 . See also: emailed error reports Text Dates Text dates can be set to varying levels of delicacy, depending on the range of text file dates chosen. 126 See also: using text dates Advanced section (menus, clipboard, deadkeys etc.) Customising menus 441 which are used in You can re-assign new shortcuts (such as Alt+F3, Ctrl+O) to the menu items the various Tools. And all grids of data have a "popup menu" which appears when you click the right button of your mouse. To customise this, in the main WordSmith Controller program, choose Main Settings | Advanced | Menus . December, 2015. Page 34

50 35 WordSmith Tools Manual You will see a list of menu options at the left, and can add to (or remove from) the list on the right by selecting one on the left and pressing the buttons in the middle, or by dragging it to the right. To re-order the choices, press the up or down arrow. In the screenshot I've added "Concordance" as I usually want to generate concordances from word-lists and key word lists. 80 Whatever is in your popup menu will also appear in the Toolbar . Below, you see a list of Shortcuts, with Ctrl+M selected. To change a shortcut, drag it up to the Customised menu list or the popup menus and toolbars list. The Restore defaults button puts all shortkeys back to factory settings. To save the choices 113 permanently, see Saving Defaults . Other December, 2015. Page 35

51 36 Controller Here you may press a button to restore all factory defaults, useful if your settings are giving trouble. prompt to save (in general) : reminds you to save every time new data results are computed or re- organised. : (default=false) prompt after WordList or prompt to save concordances computed from other Tools KeyWords or WSConcGram gets a concordance computed. require precomposed characters : some languages have a lot of cases where two characters get merged in the display into one, e.g. . WordSmith will automatically check e with ` appearing as è for such pairs when processing languages such as Amharic, Arabic, Bengali, Farsi, Gujarati, Hindi, Kannada, Khmer, Lao, Malayalam, Nepali, Oriya, Thai, Tibetan, Telegu, Tamil, Yoruba. If you want to force WordSmith to carry out a test for such pairs when processing all languages, however, check this box. Clipboard Here you may choose defaults for copying. December, 2015. Page 36

52 37 WordSmith Tools Manual 422 The number of characters only applies when copying as editable text. See also: clipboard User .dll If you have a DLL which you want to use to intercept WordSmith's results, you can choose it here. The one this user is choosing, WordSmithCustomDLL.dll , is supplied with your installation and can be used when you wish. If "Filter in Concord" is checked, this .dll will append all concordance lines found in plain text to a file called Concord_user_dll_concordance_lines.txt in your \wsmith6 folder, if there is space on the hard disk. Language Input Deadk eys are used to help type accented characters with some keyboards. The language input tab lets you alter the deadkeys to suit your keyboard and if necessary force WordSmith to use the keyboard layout of your choice whenever WordSmith starts up. December, 2015. Page 37

53 38 Controller Here the user's Windows has four keyboard layouts installed. To type in Maori, you might choose to select Maori, and change a couple of deadkeys. At present, as the list shows, pressing ` then . A gives À , but users of Maori usually prefer that combination to give A To change these settings, 1. select the line 2. edit the box below: (you can drag the character you need from the 28 character window ) then press Change. When you've changed all the characters you want, press Save. If you want WordSmith to force the keyboard to Maori too every time it starts (this will probably be necessary if it is not a New Zealand computer) then check the always use selected k eyboard box. December, 2015. Page 38

54 39 WordSmith Tools Manual Text Conversion If your text files happen to contain UTF-8 text files, WordSmith will notice and may offer to convert them on the spot using the options below. 441 See also : menu and button options . 5.5 batch processing The point of it... Batch processing is used when you want to make separate lists, but you don't want the trouble of doing it one by one, manually selecting each text file, making the word list or concordance, saving it, and so on. If you have selected more than one text file you can ask WordList, Concord and KeyWords to process as a batch. December, 2015. Page 39

55 40 Controller Folder where they end up 425 The name suggested is today's date . Edit it if you like. Whatever you choose will get created when the batch process starts. The results will be stored in folders stemming from the folder name. That is, if you start making word lists in c:\wsmith\wordlist\05_07_19_12_00, they will end up like this: c:\wsmith\wordlist\05_07_19_12_00\0\fred1.lst c:\wsmith\wordlist\05_07_19_12_00\0\jim2.lst .. c:\wsmith\wordlist\05_07_19_12_00\0\mary512.lst then c:\wsmith\wordlist\05_07_19_12_00\1\joanna513.lst etc. Filenames will be the source text filename with the standard extension (.lst, .cnc, .kws) . Zip them .zip file. You can extract them using If checked, the results are physically stored in a standard your standard zipping tool such as Winzip, or you can let WordSmith do it for you. The files within are exactly the same as the uncompressed versions but save disk space -- and the disk system will also be less unhappy than if there are many hundreds of files in the same folder. If you zip them, you will get c:\wsmith\wordlist\05_07_19_12_00\batch.zip and all the sub-files will get deleted unless you check "keep both .zip and results". One file / One file per folder? The first alternative (default) makes one .zip file with all your individual word-lists in it. Each word-list or concordance or keywords list is for one source text. But what if your text files are structured like this: \..\BNC \..\BNC\written December, 2015. Page 40

56 41 WordSmith Tools Manual \..\BNC\written\humanities \..\BNC\written\medicine \..\BNC\written\science \..\BNC\spoken etc. One file per folder, individual zipfiles The makes a separate .zip of each separate folderful of textfiles (eg. one for humanities, another for medicine, etc.), with one list for each source text. The One file per folder, amalgamated zipfiles makes a separate .zip of each folderful, but makes one word-list or concordance from that whole folderful of texts. Batch Processing and Excel These options may also offer a chance for data to be copied automatically to an Excel file. Faster (Minimal) Processing This checkbox is only enabled if you are about to start a process where more than one kind of result can be computed simultaneously. For example, if you are computing a concordance, by default 207 191 179 will be computed when each concordance is , patterns collocates and dispersion plots 247 251 , link calculations etc. which will done. In KeyWords, likewise, there will be dispersion plots be computed as the KWs are calculated. If checked, only the minimal computation will be done (KWs in KeyWords processing, concordance in ). This will be faster, and you can always get the plots computed later as long as the Concord 430 source texts don't get moved or deleted. : you're making word lists and have chosen 1,200 text files which are from a magazine Example called "The Elephant". You specify C:\WSMITH\WORDLIST\ELEPHANT as your folder name. , you will be asked for permission C:\WSMITH\WORDLIST\ELEPHANT If you already have a folder called December, 2015. Page 41

57 42 Controller to erase it and all sub-folders of it! After you press OK, trunk.LST, tail.LST .. digestive system.LST . They 1,200 new word-lists are created, called are all in numbered sub-folders of a folder called . C:\WSMITH\WORDLIST\ELEPHANT C:\WSMITH\WORDLIST If you did not check "zip them into 1 .zip file", you will find them under \ELEPHANT\0 . If you did check "zip them into 1 .zip file", there is now a C:\WSMITH\WORDLIST\ELEPHANT.ZIP file which contains all your results. (The 1,200 .LST files created will have been erased but the .ZIP file contains all your lists.) .zip file is that it takes up much less disk space and is easy to email to others. The advantage of a WordSmith can access the results from within a .zip file, letting you choose which word list, concordance etc. you want to see. Getting at the results in WordSmith Ch oose File | Open as usual, then change the file-type to "Batch file *.zip". When you choose a .zip file, you will see a window listing its contents. Double-click on any one to open it. Note: of course Concord will only succeed in opening a concordance and KeyWords a key word list file. If you choose a .zip file which contains something else, it will give an error message. 106 See also: batch scripts choosing texts 5.6 This chapter explains how to select texts, save a selection and even attach a date going back as far as 4000BC to each text file. 42 You need text in a suitable format . 5.6.1 text formats 419 444 In WordSmith you need plain text files as Plain , such as you get if you save a Word .doc .txt ). The text format should be ASCII or ANSI or Unicode (UTF16). Text ( files will look crossed out and should not be used: convert them to .txt or .doc Any Word .docx 444 first . 50 display but Files available can be used; they will be coloured red in the Text files within .zip files WordSmith can read them and get the texts you select within them. Why not .PDF files? Don't choose .pdfs either, they have a very special format. Essentially a PDF is a set of December, 2015. Page 42

58 43 WordSmith Tools Manual instructions telling a printer or browser where to place coloured dots. The plain text is usually hard to extract even if you use Adobe Acrobat (and Adobe invented the format). Why not .DOC files? is rather unsuitable even if it does contain the text: this is what a A .DOC containing only the .DOC word hello looks like in Word, then opened up in Notepad, then the .PDF of the same. December, 2015. Page 43

59 44 Controller Check the format is OK In the file-choose window you can test the format of the texts you've chosen with the Test File 46 Format ( ) button. 5.6.2 the file-choose window How to get here 4 This function is accessed from the File menu in the Controller and the Settings menu or New menu item ( ) in the various Tools. December, 2015. Page 44

60 45 WordSmith Tools Manual The two main areas at left and right are · files to choose from (at left) · files already selected (at right) The button which the red arrow points at is what you press to move any you have selected at the left to your "files selected" at the right. Or just drag them from the left to the right. The list on the right shows full file details (name, date, size, number of words (above shown with ?? as WordSmith doesn't yet know, though it will after you have concordanced or made a word list) and whether the text is in Unicode (? for the same reason). To the right of Unicode is a column stating 137 . whether each text file meets your requirements If you have never used WordSmith before (more precisely if you have not yet saved any concordances, word lists etc.) you will find that a chapter from Charles Dickens' Tale of 2 Cities has been selected for you. To stop this happening, make sure that you do save at least one word list or 97 . concordance! See also -- previous lists This puts the current file selection into store. All files of the type you've specified in any sub-folders will also get selected if the "Sub-folders too" checkbox is checked. You can check on which ones have been selected under All Current Settings. December, 2015. Page 45

61 46 Controller Clear As its name suggests, this allows you to change your mind and start afresh. If any selected filenames are highlighted, only these will be cleared. More details File Types The default file specification is *.* (i.e. any file) but this can be altered in the box or set 113 permanently in wordsmith6.ini . Tool In the screenshot above you can see -- we are choosing texts for Concord. There are alternatives available (WordList, KeyWords etc.). Select All Selects all the files in the current folder. Drives and Folders Double-click on a folder to enter it. You can re-visit a folder if its name is in the folder window history list, and easily go back with the standard Windows "back" button . Or click on the button to choose a new drive or folder. Sub-Folders If checked, when you select a whole driveful or a whole folderful of texts at the left, you will select it plus any files in any sub-folders of that drive or folder. Sorting By clicking on the column headers ( Folder, Filename, Size, Type, Words, Unicode, Date etc.) you can re-sort the listing. Test text format This button checks the format of any files selected. In the screenshot above, no tests have been done so the display shows ? for each file. If the text file is in Unicode, the display shows U , if UB , if plain ASCII or Ansi text it will show A , if it's a Word .doc file, Unicode big-endian it'll show D . If it is in UTF-8, 8 . If you get inconsistency you'll be invited to convert them all to Unicode. 50 Favourites 50 Two buttons on the right ( and ) allow you to save or get a previous file selection , saving December, 2015. Page 46

62 47 WordSmith Tools Manual you the trouble of making and remembering a complex set of choices. Type of text files 444 419 In WordSmith you need plain text files , such as you get if you save a Word .doc as .txt ). Any Word .doc files or .pdf s will look crossed out and should not be Plain Text ( 444 . used: convert them to .txt first is greyed out because it has the hidden attribute) ( 10words.TXT Setting text file dates You can edit the textual date to be attached to any text file within any date range from 4000BC upwards. (On first reading from disk the date will be set to the date that text file was last edited.) How to do it December, 2015. Page 47

63 48 Controller Press the button circled in this screen-shot: A window opens up letting you set text file dates and times. Here below you will see Shakespeare plays with their dates being edited. Delicacy offers a choice of various time ranges (centuries, years, etc.) which will help ignore excessive detail. If years are chosen as above, month, day and hour of editing are no longer relevant and default to 1st July at 12:00. If you choose a suitable text file and press the Auto-date button, each of your chosen text files will be updated if its file-name and a suitable date are found in the list. The format of the list is filenamedate (formatted YYYY or YYYY/MM/DD for year, month and day) Examples: A0X 1991 B03 1992/04/17 Here we see BNC text files sorted by date. The ones at the top had no date, then the first (a spoken sermon) dated as 1901, which is when the header says the dated was KNA.XML December, 2015. Page 48

64 49 WordSmith Tools Manual tape-recording was made(!). Your %USER_FOLDER% folder includes an auto-date file for the BNC ( BNC dates ) and another for the Shakespeare corpus ( Shakespeare plays dated plain ). 352 There is also a utility in the File Utilities which can parse text files to generate dates using your own syntax, preparing a text file like this to read in. 50 You can save the dates and files as favourites so as to re-use this information as often as you like. 126 See also: using text dates Advanced Opens a toolbar showing some further buttons: The buttons at the top left let you see the files available as icons, as a list, or with full details (the default) instead. Random This re-orders the files (on both sides) in random order. View in Notepad Lets you see the text contents in the standard Windows simple word-processor for text files, Notepad. Get from Internet 12 so as to download text from the Internet. Allows you to access WebGetter Check Checks whether the files selected are available to read (e.g. after loading up Favourites). December, 2015. Page 49

65 50 Controller Save List Lets you save any already stored text files as a plain text list (e.g for adding date information). Zip files If checked, when loading up a whole folder of text files, WordSmith will automatically include ones from .zip files. 453 Whether checked or not, if you double-click on a zip file you can enter that as if it were a folder and see the contents. Zip files will be coloured red. In this screen, the historical plays of Shakespeare within a zip file ( plays.zip ) have been selected. 430 116 , Finding source texts . See also : Step-by-step online example , Viewing source texts 5.6.3 favourite texts save favourites Used to save your current selection of texts. Useful if it's complex, e.g. involving several different folders. Essential if you've attached a date to your text files. Saves a list of text files whose status is either unknown or known to meet your requirements when 137 , ignoring any which do not. selecting files by their contents get favourites Used to read a previously-saved selection from disk. By default the file name will be the name of the tool you're choosing texts for plus recent_chosen_text_files.dat , in your main WordSmith folder. December, 2015. Page 50

66 51 WordSmith Tools Manual ) a set of choices you have edited using Notepad, but You may use a plain text file for loading ( note that each file needed must be fully specified: wildcards are not used and a full drive:\folder path is needed. You may date the text file if you like by appending to the file-name a character followed by the date (any date after 1000BC) in the format yyyy/mm/dd e.g. -399/07/01 c:\text\socrates.txt c:\text\hamlet.txt 1600/07/01 c:\text\second world war.txt 1943/05/22 48 44 See also: Choosing Texts , file dates choosing files from standard dialogue box 5.7 44 ; it also allows you to The dialogue box here is very similar to the one used for choosing text files 453 choose from a zip file . 379 to examine a file: this makes no sense in the case of a word list, You can use Viewer & Aligner key word list, or concordance, but may be useful if you need to examine a related text file, e.g. a readme.txt in the same zip file as your concordance or word lists. To choose more than one file, hold the Control key down as you click with your mouse, to select as files as you want. Or hold down the Shift key to select a whole range of them. many separate 5.8 class or session instructions When WordSmith is run in a training session, you may want to make certain instructions available to your trainees. teacher.rtf in your main To do this, all you need to do is ensure there is a file called folder where the WordSmith programs are or in the "instructions folder" explained under \wsmith6 23 . If one is found, it will be shown automatically when WordSmith starts up. site licence defaults To stop it being shown, just rename it! You edit the file using any Rich Text Format word processor, such as MS Word™, saving as an file. .rtf 23 See also: Site licence defaults 5.9 colour categories The point of it ... With a concordance or word list on your screen it can be hard for example to know how many of the thousands of entries met certain criteria. For example which ones derived from only a few texts? mytext.txt ? How many of the concordance lines came both from Which ones ended in -NESS and from the first 40 words in the sentence, and which ones are they? 315 your existing data by your own criteria. (Since last millennium The idea is to let you re-sort 168 column WordSmith lists have been sortable by standard criteria, and there has long been a Set for your own classification, but this feature makes it possible to have multiple and complex sorts.) December, 2015. Page 51

67 52 Controller How to do it The menu option Compute | Colour categories will be found if the data have a Set column. The menu option brings up a window where you specify your search criteria. Here is an example: December, 2015. Page 52

68 53 WordSmith Tools Manual Complete the form by choosing a data column (above the user chose the File column) and a below which will mean 'search the file column seeking any condition (here X ends with Y and .txt where the File ends in .txt'). Then choose a colour (here colour 67 was chosen) and then press Add a search . Finally, press Find . As you can just see, the Set column in the concordance has some items coloured. A more complex example: December, 2015. Page 53

69 54 Controller where the user wants to process the Word column of data, looking for a condition where the word starts with UN occurs at least 5 times. For any word which meets this condition, the Set and column will show the colour selected. When you have specified the criteria, press the Find button. December, 2015. Page 54

70 55 WordSmith Tools Manual The top of the Colour Categories window shows the percentage results. In the example below, the user has decided to omit their first search and to carry out another on the same word-list which found 188 words ending in NESS which were present in more than 40 BNC texts. But not This option lets you have a negative condition. December, 2015. Page 55

71 56 Controller Where are they in the list? To locate the items which colour categorising has found, simply sort the Set column. (If it's a Freq. list you may have to go to the Alphabetical tab first.) The categorised items float to the top. Here, BE the 6 words between with frequency above 5 are coloured green at the top of the word and BF list, with the 13 NESS items with frequency less than or equal to 5 coloured blue. December, 2015. Page 56

72 57 WordSmith Tools Manual Once sorted, the data can be saved as before. What if I already have a Set classification? Here is a concordance where the exclamation O or Ah had already been identified and marked in the Set column. December, 2015. Page 57

73 58 Controller As the Set column is already in use, classifying further by colour will take second priority to the existing forms typed in. So in this case: where 58 cases were found where the exclamations came in the first 49% of the text, we see that line 10 goes green (11 did not go green because the criterion was less than 50 and it had exactly 50%) but clicking the Set column gives priority to the exclamation typed in. December, 2015. Page 58

74 59 WordSmith Tools Manual s follow the uncoloured ones. In this case the O s follow the Ah s, and the coloured Ah Removing the colours? Use the Clear colours button. What if more than one condition is met? If you colour words ending NESS blue, and also colour words starting UN yellow, any word meeting both conditions will get a mixture of the two colours as shown here: December, 2015. Page 59

75 60 Controller 171 168 , colour categories for concordances See also : setting categories by typing colours 5.10 4 . Enables you to Found in main Settings menu in all Tools and Main Settings in the Controller choose your default colours for all the Tools. Available colours can be set for plain text this is the default colour as above when selected highlighted text 198 mark-up tags 159 concordance search word; words in (key) word lists search word 315 main sort word indicates first sort preference; used for % data in (key) word lists second sort word indicates first tie-breaker sort colour context word context word any line of deleted data deleted words any line which has not been user-sorted not numbered line concordance search word when selected search word highlighted main sort word first sort when selected highlighted first tie-breaker sort when selected second sort word highlighted context word context word when selected highlighted most frequent p value most frequent collocate or detailed consistency word, collocate 235 in keywords 11 viewing texts in the text viewer lemma colour colour of lemmas shown in lemma window word-cloud shape see word clouds section below word-cloud window word-cloud word December, 2015. Page 60

76 61 WordSmith Tools Manual Overall colour scheme This allows a range of colour scheme choices, which will affect the colours of all WordSmith windows. List colours To alter colours, first click on the wording you wish to change (you'll see a difference in the left margin: here search word has been chosen), then click on a colour in the colour box. The radio buttons determine whether you're changing foreground or Foreground and Back ground background colours. You can press the Reset button if you want to revert to standard defaults. The same colours, or equivalent shades of grey, will appear in printouts, or you can set the printer 80 to black and white, in which case any column not using "plain text" colour will appear in italics (or bold or underlined if you have already set the column to italics). The Reset button lets you restore colours to factory defaults. Ruler This opens another dialogue window, in which you can set colours and plot divisions for the ruler: December, 2015. Page 61

77 62 Controller Word Clouds These settings allow to to choose how each word will be displayed, e.g. within rectangles or circles. The colours of the words and the word cloud window are set in the List colours section above. 87 See also: Column Layout for changing the individual colours of each column of data, Colour 51 128 . Categories , Word Clouds 5.11 column totals The point of it... This function allows you to see a total and basic statistics on each column of data, if the data are numerical. How to do it With a word-list, concordance or key-words list visible, choose the menu item View | Column Totals to switch column totals on or off. December, 2015. Page 62

78 63 WordSmith Tools Manual Here we see column totals on a detailed consistency list based on Shakespeare's plays. The list itself is sorted by the Texts column: the top items are found in all 35 of the plays used for the list. In the case of Anthony and Cleopatra, A represents 1.28% of the words in that column, that is 1.28% this is the highest percentage in of the words of the play Anthony and Cleopatra. In the case of ACT its row (this word is used more in percentage terms in that play than in the others). 102 See also: save as Excel 5.12 compute new column of data The point of it... This function brings up a calculator, where you can choose functions to calculate values which interest you. For example, a word list routinely provides the frequency of each type, and that frequency as a percentage of the overall text tokens. You might want to insert a further column showing the frequency as a percentage of the number of word types, or a column showing the frequency as a percentage of the number of text files from which the word list was created. December, 2015. Page 63

79 64 Controller This word-list has a column which computes the cumulative scores (running total of the % column). How to do it and create your own formula. You'll see standard calculator Just press Compute | New Column buttons with the numbers 0 to 9, decimal point, brackets, 4 basic functions. To the right there's a list of standard mathematical functions to use (pi, square root etc.): to access these, double-click on them. Below that you will see access to your own data in the current list, listing any number-based column-headings. You can drag or double-click them too. December, 2015. Page 64

80 65 WordSmith Tools Manual Absolute and Relative Your own data can be accessed in two ways. A relative access (the default) means that as in a spreadsheet you want the new column to access data from another column but in the same row. Absolute access means accessing a fixed column and row. Examples you type Result -- for each row in your data, the new column will contain: Rel(2) ÷ 5 the data from column 2 of the same row, divided by 5 the data from column 2 of the same row, added to a running RelC(2) total the data from column 2 of the same row, divided by 5, added to Rel(3) + (Rel(2) ÷ 5) the data from column 3 of the same row Abs(2;1) ÷ 5 the data from column 2 of row 1, divided by 5. (This example is just to illustrate; it would be silly as it would give the exact same result in every row.) Rel(2) ÷ Abs(2;1) × 100 the data from column 2 of the same row divided by column 2 of row 1 and multiplied by 100. This would give column 3 as a percentage of the top result in column 2. For the first row it'd give 100%, but as the frequencies declined so would their December, 2015. Page 65

81 66 Controller percentage of the most frequent item. 87 this way: see layout You can format (or even delete) any variables computed in . 62 66 51 See also: count data frequencies , colour categories , column totals 5.13 copy your results The quickest and easiest method of copying your data e.g. into your word processor is to select 422 (click to see with the cursor arrows and then press Ctrl+C. This puts it into the clipboard examples showing how to copy into Word etc.). you get various choices: If you choose File | Save As 102 saving as a text file or XML or spreadsheet 101 as data (not the same as saving as text: this is saving so you can access your data again save another day) 97 101 422 , clipboard , printing See also: saving count data frequencies 5.14 In various Tools you may wish to further analyse your data. For example with a concordance you may want to know how many of the lines contain a prefix like un- or how many items in a word-list end in - ly . To do this, choose Summary Statistics in the Compute menu. Load This allows you to load into the searches window any plain text file which you have prepared previously. For complex searching this can save much typing. An example might be a list of suffixes or prefixes to check against a word list. Search Column This lets you choose which column of data to count in. It will default to the last column clicked for your data. Breakdown by If activated this lets you break down results, for example by text file. See the example from Concord 215 . Cumulative Column December, 2015. Page 66

82 67 WordSmith Tools Manual 304 This adds up values from another column of data. See the example from WordList . 215 See also: distinguishing consequence from consequences , frequencies of suffixes in a word list 63 304 , compute new column of data . custom processing 5.15 416 , is not for those without a tame programmer to help -- is found This feature -- which, like API . under Main Settings | Advanced The point of it... I cannot know which criteria you have in processing your texts, other than the criteria already set up (the choice of texts, of search-word, etc.) You might need to do some specialised checks or formats. For example, you might need to WordSmith alteration of data before it enters the lemmatise a word according to the special requirements of your language. This function makes that possible. If for example you have chosen to filter concordances, as Concord processes your text files, every time it finds a match for your search-word, it will call your .dll file. It'll tell your own .dll what it has found, and give it a chance to alter the result or tell Concord to ignore this one. How to do it... .dll Choose your file (it can have any filename you've chosen for it) and check one or more of the options in the Advanced page. You will need to call standard functions and need to know their names and formats. It is up to you to write your own .dll program which can do the job you want. This can be written in any programming language (C++, Java, Pascal, etc.). An example for lemmatising a word in WordList The following DLL is supplied with your installation, compiled & ready to run. Your .dll needs to contain a function with the following specifications function WordlistChangeWord( original : pointer; language_identifier : DWORD; is_Unicode : WordBool) : pointer; stdcall; The language_identifier is a number corresponding to the language you're working with. See List of . Locale ID (LCID) Values as Assigned by Microsoft So the "original" (sent by WordSmith) can be a PCHAR (7 or 8-bit) or a PWIDECHAR (16-bit Unicode) and the result which your .dll supplies can point to a) nil (if you simply do not want the original word in your list) b) the same PCHAR/PWIDECHAR if it is not to be changed at all December, 2015. Page 67

83 68 Controller c) a replacement form Here's an example where the source text was Today is Easter Day. Source code The source code for the .dll in Delphi is this library WS5WordSmithCustomDLL; uses Windows, SysUtils; { This example uses a very straightforward Windows routine for comparing strings, CompareStringA and CompareStringW which are in a Windows .dll. The function does a case-insensitive comparison because NORM_IGNORECASE (=1) is used. If it was replaced by 0, the comparison would be case-sensitive. In this example, EASTER gets changed to CHRISTMAS. } function WordlistChangeWord( original : pointer; language_identifier : DWORD; is_Unicode : WordBool) : pointer; stdcall; begin Result := original; if is_Unicode then begin if CompareStringW( language_identifier, NORM_IGNORECASE, PWideChar(original), -1, December, 2015. Page 68

84 69 WordSmith Tools Manual PWideChar(widestring('EASTER')), -1) - 2 = 0 then Result := pwidechar(widestring('CHRISTMAS')); end else begin if CompareStringA( language_identifier, NORM_IGNORECASE, PAnsiChar(original), -1, PAnsiChar('EASTER'), -1) - 2 = 0 then Result := pAnsichar('CHRISTMAS'); end; end; function ConcordChangeWord( original : pointer; language_identifier : DWORD; is_Unicode : WordBool) : pointer; stdcall; begin Result := WordlistChangeWord(original,language_identifier,is_unicode); end; function KeyWordsChangeWord( original : pointer; language_identifier : DWORD; is_Unicode : WordBool) : pointer; stdcall; begin Result := WordlistChangeWord(original,language_identifier,is_unicode); end; { This routine exports each concordance line together with the filename it was found in a number stating how many bytes into the source text file the entry was found its hit position in that text file counted in characters (not bytes) and the length of the hit-word (so if the search was on HAPP* and the hit was HAPPINESS this would be 9) This information is saved in Unicode appended to your results_filename } function HandleConcordanceLine (source_line : pointer; hit_pos_in_characters, hit_length : integer; byte_position_in_file, language_id : DWORD; is_Unicode : WordBool; source_text_filename, results_filename : pwidechar) : pointer; stdcall; function extrasA : ansistring; begin Result := #9+ ansistring(widestring(pwidechar(source_text_filename)))+ #9+ ansistring(IntToStr(byte_position_in_file))+ #9+ ansistring(IntToStr(hit_pos_in_characters))+ #9+ ansistring(IntToStr(hit_length)); end; function extrasW : widestring; begin December, 2015. Page 69

85 70 Controller Result := #9+ widestring(pwidechar(source_text_filename))+ #9+ IntToStr(byte_position_in_file)+ #9+ IntToStr(hit_pos_in_characters)+ #9+ IntToStr(hit_length); end; const bm: char = widechar($FEFF); var f : File of widechar; output_string : widestring; begin Result := source_line; if length(results_filename)>0 then try AssignFile(f,results_filename); if FileExists(results_filename) then begin Reset(f); Seek(f, FileSize(f)); end else begin Rewrite(f); Write(f, bm); end; if is_Unicode then output_string := pwidechar(source_line)+extrasW else output_string := pAnsichar(source_line)+widestring(extrasA); if length(output_string) > 0 then BlockWrite(f, output_string[1], length(output_string)); CloseFile(f); except end; end; exports ConcordChangeWord, KeyWordsChangeWord, WordlistChangeWord, HandleConcordanceLine; begin end. 416 133 See also : API , custom settings editing 5.16 5.16.1 reduce data to n entries With a very large word-list, concordance etc., you may wish to reduce it randomly (eg. for sampling). This menu option ( Edit | Deleting | Reduce to N ) allows you to specify how many entries 129 until there you want to have in the list. If you reduce the data, entries will be randomly zapped are only the number you want. The procedure is irreversible. That is, nothing gets altered on disk, but if you change your mind you will have to re-compute or else go back to an earlier saved version. December, 2015. Page 70

86 71 WordSmith Tools Manual 72 129 . See also: zapping , editing a list of data reverse-delete 5.16.2 This function simply reverses any deletions in your entries. That is, any entries which were not previously marked as deleted get deleted, and any which were marked as deleted get restored. 71 See also : Delete or Restore to End . delete or restore to end 5.16.3 These menu options let you mark all entries from the current selected entry downwards as deleted -- or alternatively restore to the end. 71 70 , Reduce to N entries See also: Reverse-delete 5.16.4 delete if The idea is to be able to delete entries with a search. The search operates on the column of data which you have currently selected, so first ensure you click the data in the desired column. 159 is as in Concord, so you may need to use asterisks. The syntax If you are searching a concordance line, the search will operate on the whole of the line that Concord knows about, not just the few words you can see. editing column headings 5.16.5 By default, a word-list will have column headings like these: If you choose you get to see the various headings: View | Layout, December, 2015. Page 71

87 72 Controller and if you double-click any of these you may edit it to change the column header as in this (absurd) example: If you now save your word-list, the new column heading gets saved along with the data. Other new word-lists, though, will have the default WordSmith headings. If you want all future word-lists to have the same headings, you should press the Save button in the 87 layout window . (If you had been silly enough to call the word column "Ulan Bator" and to have saved this for all subsequent word-lists, you could remedy the problem by deleting Documents\wsmith6 \wordlist list customised.dat .) 5.16.6 editing a list of data With a word list on screen, you might see something like this. December, 2015. Page 72

88 73 WordSmith Tools Manual In the status bar at the bottom, the number in the first cell is the number of words in the current word list and AA in the third cell is the word selected. At the moment, when the user types anything, WordList will try to find what is typed in the list. If you right-click the second cell you will see and can change the options for this list to Set (to classify your words, eg. as adjectives v. nouns) or Note that some of the data is calculated using other data and therefore cannot Edit , to alter them. be edited. For example, frequency percentage data is based on a word's frequency and the total number of running words. You can edit the word frequency but not the word frequency percentage. Choose Edit. Now, in the column which you want to edit, press any letter. This will show the toolbar (if it wasn't visible before) so you can alter the form of the word or its frequency. If you spell the word so that it matches another existing word in the list, the list will be altered to reflect your changes. . In this case we want to correct AACUTE , which should be Á December, 2015. Page 73

89 74 Controller If you now type Á , you will immediately see the result in the window: Clicking the downward arrow at the right of the edit combobox, you will see that the original word is there just in case you decide to retain it. 441 After editing you may want to re-sort ( ), and if you have changed a word such as AAAAAGH to 270 a pre-existing word such as , to join the two entries. AAGH 430 270 . also: joining entries See , finding source files December, 2015. Page 74

90 75 WordSmith Tools Manual 5.17 find relevant files The point of it... Suppose you have identified muscle, fibre, protein as key words in a specific text. You might want to find out whether there are any more texts in your corpus which use these words. How to do it This function can be reached in any window of data which contains the menu option, e.g. a word- 6 list or a key words list . It enables you to seek out all text files which contain mention of the words you have at least one 44 choose the set of texts marked or selected. Before you click, you want to peruse. (If you which haven't, the function will let you use the text(s) the current key words or word-list entries are based on.) December, 2015. Page 75

91 76 Controller 112 Here we have a keywords list based on a Chinese folk tale with two items chosen by marking . The text files to examine in this case are all the Shakespeare tragedies... What you get 430 A display based on all the words you marked, showing which text files they were found in and how many of each word were found. If you double-click as shown December, 2015. Page 76

92 77 WordSmith Tools Manual you'll get to see the source text and can examine each of the words, in this case the four tokens of the type dream . December, 2015. Page 77

93 78 Controller 5.18 folder settings These are found in the main Controller. The settings folder will be default be a sub-folder of your My Documents folder but it can be set elsewhere if preferred. 5.19 fonts 4 Found by choosing Settings | Font in all Tools or via Language Settings in the Controller . Enables you to choose a preferred Windows font and point size for the display windows and printing 97 81 in all the WordSmith Tools suite. Note that each language can have its own different default font. December, 2015. Page 78

94 79 WordSmith Tools Manual If you have data visible in any Tool, the font will automatically change; if you don't want any specific windows of data to change, because you want different font sizes or different character sets in different windows, minimise these first. 87 To set a column of data to bold, italics, underline etc., use the layout option . 81 WordSmith Tools will offer fonts to suit the language chosen in the top left box. Each language may require a special set of fonts. Language choice settings once saved can be seen (and altered, with care) in Documents\wsmith6\language_choices.ini . December, 2015. Page 79

95 80 Controller 5.20 main settings 4 Found in in the WordSmith Tools Controller . Main settings Startup Restore last work will bring back the last word-list, concordance or key-words list when you start WordSmith. Show Help file will call up the Help file automatically when you start WordSmith. Associate/clear file extensions will teach Windows to use (or not to use) Concord, WordList, KeyWords etc. to open the relevant files made by WordSmith. Check for updates December, 2015. Page 80

96 81 WordSmith Tools Manual WordSmith can be set to check for updated versions weekly, monthly or not at all. You may freely update your version within the version purchased (e.g. 6.0 allows you to update any 6.x version until 7.0 is issued). Toolbar & Status bar Each Tool has a status bar at the bottom and a toolbar with buttons at the top. By default the toolbar is hidden to reduce screen clutter. System The first box gives a chance to force the boxes which appear for choosing a file to show the files in various different ways. For example "details" will show listings with column headers so with one click you can order them by date and pick the most recent one even if you cannot remember the exact filename. The Associate/clear file extensions button will teach Windows to use (or not to use) Concord, WordList, KeyWords etc. to open the relevant files made by WordSmith. 5.21 language The point of it ... 1. Different languages sometimes require specific fonts. 2. Languages vary considerably in their preferences regarding sorting order. Spanish, for example, uses this order: A,B,C,CH,D,E,F,G,H,I,J,K,L,LL,M,N,Ñ,O,P,Q,R,S,T,U,V,W,X,Y,Z. And accented characters are by default treated as equivalent to their unaccented counterparts in some languages donne, donné, données, donner, donnez , etc.) but in other languages (so, in French we get accented characters are not considered to be related to the unaccented form in this way (in Czech we get cesta .. cas .. hre .. chodník ..) Sorting is handled using Microsoft routines. If you process texts in a language which Microsoft haven't got right, you should still see word-lists in a consistent order. Note that case-sensitive means that Mother will come after mother (not before apple or after zebra ). It is important to understand that a comparison of two word-lists (e.g. in KeyWords) relies on sort order to get satisfactory results -- you will get strange results in this if you are comparing 2 word- lists which have been declared to be in different languages. Settings December, 2015. Page 81

97 82 Controller 4 under . The Language Settings Choose the language for the text you're analysing in the Controller 419 language and character set must be compatible, e.g. English is compatible with Windows Western (1252), DOS Multilingual (850). WordSmith Tools handles a good range of languages, ranging from Albanian to Zulu. Chinese , Japanese , Arabic etc. are handled in Unicode. You can view word lists, concordances, etc. in different languages at the same time. 124 Characters within word Hyphens separate words, Numbers, 78 Font 124 Text Format How more languages are added Press the button. Edit Languages 419 420 , Processing text in Chinese , Accented characters See also: Choosing Accents & Symbols 146 419 , Changing language etc., Text Format December, 2015. Page 82

98 83 WordSmith Tools Manual 5.21.1 Overview You will probably only need to do this once, when you first use WordSmith Tools. How to get here The Language Chooser is accessed from the main WordSmith Controller menu: Settings | Main Settings | Text and Languages | Other Languages . What you will see may look like this: December, 2015. Page 83

99 84 Controller 9 languages have been chosen already. At the bottom you will see what the fonts on your system are for the current language selected. 86 86 86 87 84 , , saving your choices , Sort Order , Font See also : Language , Other Languages 419 Changing the language of saved data Language Chooser 5.21.2 How to get here The Language Chooser is accessed from the main WordSmith Controller: December, 2015. Page 84

100 85 WordSmith Tools Manual What it does The list of languages on the left shows all those which are supported by the PC you're using. If any of them are greyed, that's because although they are "supported" by your version of Windows, they haven't been installed in your copy of Windows. (To install more multilingual support, you will need your original Windows cdrom or may be able to find help on the Internet.) On the right, there are the currently chosen languages for use with WordSmith. The default December, 2015. Page 85

101 86 Controller language should be marked #1 and others which you might wish to use with *. To change the status of a chosen language, right-click. This user is about to make Persian the #1 default. To delete any unwanted language, right-click and choose "demote". To add a language, drag it from the left window to the right, then set the country and font you prefer for that particular language. For each chosen language, you can specify any symbols which can be included within a word, 125 , where it makes more sense to think of "don't" as one word than e.g. the apostrophe in English as "don" and "t". You can also specify whether a hyphen separates words or not (e.g. whether "self-conscious" is to be considered as 2 words or 1). 86 available changes. Each time you change language, the list of fonts 427 . Some languages do not mark word-separators 86 87 See also : Other Languages , saving your choices Other Languages 5.21.3 To work on a language not in the list, go to New Language, type in a suitable name and base your new language on one of the existing languages. Choose a font which can show the characters & symbols you want to include. Sort order is handled as for the language you base your new language on. 86 86 87 84 , Sort Order See also : Language , saving your choices , Font 5.21.4 Font The Fonts window shows those available for each language, depending on fonts you have installed. You will need a font which can show the characters you need: there are plenty of specialised fonts to be found on the Internet. Unicode fonts can show a huge number of different characters, but require your text to be saved in Unicode format. If you change font, the list of characters available changes. Click here for more on Unicode . 86 84 87 86 , Other Languages See also : Language , saving your choices , Sort Order Sort Order 5.21.5 Sorting is done in accordance with the language chosen. (Spanish, Danish, etc. sort differently from English.) The display December, 2015. Page 86

102 87 WordSmith Tools Manual · You will see 2 windows below "Resort" -- the one at the left contains some words in various languages; you can add your own. If your keyboard won't let you type them in, paste from your own collection of texts. · The one at the right shows how these words get sorted according to the language you have selected. 86 87 86 84 , saving your choices , Other Languages See also : Language , Font 5.21.6 saving your choices Save your results before quitting, so that next time WordSmith Tools will know your preferences regarding fonts and your #1 default language and your subsidiary default languages and you won't need to run this again. Results will be in Documents\wsmith6\language_choices.ini . 86 86 86 84 , Other Languages See also : Language , Font , Sort Order layout & format 5.22 to choose your preferred display formats With any list open, right-click or choose View | Layout for each column of data. December, 2015. Page 87

103 88 Controller Layout or Add data? The Layout Add a column of data lets you tab gives you a chance to format the layout of your data. 63 compute a new variable . 71 by double-clicking and typing in your own preferred heading. You can edit the headings "Frequency in the text" is too long but serves to illustrate. Move Click on the arrows to move a column up or down so as to display it in an alternative order. Alignment Allows a choice of left-aligned, centred, right-aligned, and decimal aligned text in each column, as appropriate. Typeface Normal, bold, italic and/or underlined text. If none are checked, the typeface will be normal. Screen Width in your preferred units (cm. or inches). Here 3 of the headings have been activated (by clicking) so that settings can be changed so as to get them all the same width. December, 2015. Page 88

104 89 WordSmith Tools Manual Case lower case, UPPER CASE, Title Case or source: as it came originally in the text file. The default for most data is upper case. Decimals the number of decimal places for numerical data, where applicable. For example, suppose you have this list of the key words of Midsummer Night's Dream in view but want to show the numbers in the column above 0.02, corresponding to WALL, FAIRY etc., select the column(s) you want to affect, and set the decimals eg. like this December, 2015. Page 89

105 90 Controller where the top number is the decimal places (2, unchanged from the default for percentage data) and the bottom is the threshold below which the data are not shown. In this case, any date smaller than 0.0001 won't be shown (the space will be blank). As soon as you make the change, you should immediately see the result. Visibility show or hide, or show only if greater than a certain number. (If this shows ***, then this option is not applicable to the data in the currently selected column.) Colours The bottom left window shows the available colours for the foreground & background. Click on a colour to change the display for the currently selected column of information. Restore Restores settings to the state they were in before. Offers a chance to delete any custom saved layout for the current type of data (see Save). Save The point of this Save option is to set all future lists of the same type as the one you're working on to a preferred layout. Suppose you have a concordance open. If you change the layout as you like 101 and save the concordance in the usual way it will remember your settings anyway. But the next time you make a concordance, you'll get the WordSmith default layout. If you choose this Save, the next time you make a concordance, it will look like the current one. And a custom saved layout will be found in your Documents\wsmith6 folder, eg. Concordance list customised.dat. ( The only way of removing such settings would be to rename or delete that file.) Alternatively you can choose always to show or hide certain columns of data with settings. For example, in the Controller's Concord settings the What you see tab offers these options, December, 2015. Page 90

106 91 WordSmith Tools Manual which can be saved permanently with . Freeze the columns If you have a lot of detailed consistency files and wish to freeze the word column so as to see the words for every column of numbers, choose View | Freeze columns... This allows you to set the number of fixed columns for example to 2, and the display will look as it does here: where the N and Word columns are both frozen (and cannot be re-sorted) allowing you to look at the 679th and 680th text file data. Similarly a statistics list allows the text file-names column to be frozen: December, 2015. Page 91

107 92 Controller 60 113 4 choices in WordSmith Tools Controller See also: setting & saving defaults , setting colour . 5.23 match words in list The point of it... This function helps you filter your listing. You may choose to relate the entries in a concordance or list of words (word-list, collocate list, etc.) with a set of specific words which interest you. For example, to mark all those words in your list which are function words, or all those which end in - ~ ing . Those which match are marked with a tilde ( ). With the entries marked, you can then choose to delete all the marked entries (or all the unmarked ones), or sort them according to whether they're marked or not. How to do it: WordList example With a word-list loaded up using WordList, click in the column whose data you want to match up. This will usually be one showing words, not numbers. Then choose Compute | Matches . If you have no suitable match-list settings, you may get this: The main Controller settings dialogue box appears. December, 2015. Page 92

108 93 WordSmith Tools Manual The circled areas show some of the main choices: make sure you are choosing for the right Tool, and if matching words from a text file, browse to find it and then press Load to load its words. You must of course decide what is to be done with any matching entries. Text File or Template Choose now whether you want to filter by using a text file which contains all the words you're interested in (e.g. a plain text file of function words [not supplied]) or a template filter such as *ing .). (which checks every entry to see whether it contains a word ending in ing To use a match list in a file, you first prepare a file, using or any plain text word processor, Notepad which specifies all the words you wish to match up. Separate each word using commas, or else place each one on a new line. You can use capital letters or lower-case as you prefer. You can use a semi-colon for comment lines. There is no limit to the number of words. Example ; Match list for test purposes. THE,THIS,IS IT WILL *ING If you choose a file, the Controller will then read it and inform you as to how many words there are in December, 2015. Page 93

109 94 Controller it. (There is no limit to the number of words but only the first 50 will be shown in the Controller.) Action The current Tool then checks every entry in the selected column in your current list to see whether it matches either the template or one of the words in your plain text file. Those which do match are marked or deleted as appropriate for the Action requested (as in the example below where five delete entries which match matching entries were found, the action selected was and the match list included THE , IS and IT ). I answered No so you could see this result: December, 2015. Page 94

110 95 WordSmith Tools Manual In the screenshot below, the action was find matches & mark them , and the match-list contained archaic forms like thou, thee, thy . December, 2015. Page 95

111 96 Controller The marking can be removed using a menu option or by re-running the match-list function with remove match mark ing as the action. 66 You can obtain statistics of the matches, using the Summary Statistics menu option. 263 120 260 274 , Lemma Matching , Stop Lists See also: Comparing Word-lists , Comparing Versions December, 2015. Page 96

112 97 WordSmith Tools Manual 5.24 previous lists These windows show the lists of results you have obtained in previous uses of WordSmith. To see any of these, simply select it and double-click -- the appropriate Tool will be called up and the data shown in it. The popup menu for the window is accessed by a right-click on your mouse. . To delete an entry, select it and then press Del To re-sort your entries click the header or choose Resort in the popup menu . 5.25 print and print preview Print settings are in the main Controller: December, 2015. Page 97

113 98 Controller Print Settings If you set printing to monochrome, your printer will use italics or bold type for any columns using 60 other than the current "plain text" colour . Otherwise it will print in colour on a colour printer, or in shades of grey if the printer can do grey shading. You can also change the units, adjust orientation (portrait or landscape ) and margins and default header and footer. When you choose a print or print preview menu item in a Tool, you'll be taken by default to a print preview, which shows you what the current page of data looks like, and from which you can print. December, 2015. Page 98

114 99 WordSmith Tools Manual Bigger and Smaller Zoom to 100% ( ) or fit to page ( ), or choose a view in the list. The display here works in exactly the same way as the printing to paper. Any slight differences between what you see and what you get are due to font differences. You can also pull the whole print preview window larger or smaller. Next ( ) & Last ( ) Page Takes you forward or back a page. ) or Landscape ( )? Portrait ( Sets printing to the page shape you want. Header, Footer, Margins You can type a header & footer to appear on each page. Press Show if you want them included. If you include this will put today's date and does the numbering. Margins are altered by clicking the numbers -- you will see the effect in the print previews space at the right. ) Print ( December, 2015. Page 99

115 100 Controller This calls up the standard Windows printer page and by default sets it to print the current page. You can choose other pages in this standard dialogue box if you want. Some columns of data not shown A case like this showing nothing but the line numbers is because you have pulled the concordance data too wide for the paper. WordSmith 87 prints only any columns of data which are going to fit. Shrink the column, hide any unwanted ones, or else set the print surface to landscape. December, 2015. Page 100

116 101 WordSmith Tools Manual 80 See also: Printer Settings 5.26 quit WordSmith Alt+X is the hot key. 4 Closing WordSmith Tools Controller will close down all of the Tools. If you press Alt+X, or use the System menu Close commands, you will get a chance to save any unsaved sets of data before the Tool in question closes. You will be asked to confirm closure if any window of data is still open. If you're in a hurry, use the "no-check Exit" menu option which by-passes these checks. By default, the last word list, concordance or key words listing that you saved or retrieved will be automatically restored on entry to WordSmith Tools. This feature can be turned off temporarily via a menu option or permanently in Documents\wsmith6\wordsmith6.ini . 5.27 saving save results 5.27.1 Save To save your corrected results use (Ctrl+F2) in the menu. This saves all the results so you can 129 , f irst. return to the data at a later date. You may wish to clean up any deleted items by zapping Saved data is in a special format. The only point of it is to make it possible to WordSmith Tools use the data again another day. You will not be able to examine it usefully outside the Tools. If you want to export your data to a spreadsheet, graphics program, database or word processor, etc., you 102 422 or by copying the data to the clipboard . saving as text can do this either by December, 2015. Page 101

117 102 Controller save part of the data only 129 save all your data that you have n't zapped . If you want to save only part and By default, 66 . of it, but don't want to zap it to oblivion, choose Copy avoid the save prompts? 36 You can avoid them in the Advanced settings ( Main Settings | Advanced Settings | Advanced | ). Other save as text 5.27.2 The point of it... Save as Text means save your data as a plain text file (as opposed to the WordSmith format for 422 retrieving the data another day). It is usually quicker to copy selected text into the clipboard , e.g. if you simply want to insert your results into your word processor. 422 If you want to copy the data in colour, you should definitely use the clipboard . In the case of a concordance, if you want only the words visible in your concordance line (not the number of characters mentioned below), use the clipboard and then Paste or Paste Special in graphics format. How to do it This function can be reached by Save As .. | Plain text ( ), XML text ( ), Excel spreadsheet ( ) ) to text file. ) or Copy ( or Print to File (via F3 or Options include: header words you want to save at the start of the data (leave blank if not wanted); whether the numbers visible in the column numbered at the left are saved too column separator by default a tab but you can specify something else to go between visible columns rows all/any which you have highlighted/a specific range, e.g. 1-10, 5-, -3 columns all/any which you have highlighted/a specific range (column 1 is the one with the numbers) You can then easily retrieve the data in your spreadsheet, database, word-processor, etc. (If you want to use it as a table in a word processor, first save as text, then in your word-processor choose the Convert Text to Table option if available. Choose to separate text at tabs.) ) save will look something like this: Note: The Excel spreadsheet ( December, 2015. Page 102

118 103 WordSmith Tools Manual The words are visible from row 18 onwards; above them we get some summary data. The 1/8, 2/8 etc. section splits the data into eighths; thus 100% of the Texts data (column E) is in the 8th section, whereas nearly all the data (98.8%) are in the smallest section in terms of word frequency, because so many words come once only. You'll be asked whether to compute this summary data if you choose to save as Excel. In the case of a concordance line, saving as text will save as many "characters in 'save as text'" as 222 you have set (adjustable in the Controller Concord Settings ). The reason for this is that you will probably want a fixed number of characters, so that when using a non proportional font the search- December, 2015. Page 103

119 104 Controller 211 words line up nicely. See also: Concord save and print . Each worksheet can only handle up to 65,000 rows and 256 columns. If necessary there will be continuation sheets. If your data contains a plot you will also get another worksheet in the Excel file, looking like this. 441 The plot data are divided into the number of segments set for the ruler (here they are eighths), and the percentage of each get put into the appropriate columns. That is, cell B3 means that 23.7% of the cep.txt data come in the first eighth of the text file. Set the format correctly as percentages in Excel, and you will see something like this: At the top you get the raw data, which you can use Excel to create a graphic with. December, 2015. Page 104

120 105 WordSmith Tools Manual If you want access to the details of the plot, choose save as text. The results will look like this: and you can then process those numbers in another program of your choice. ), you get a little .HTM file and a large .XML file. Click on the .HTM file In the case of XML text ( and you can see your data a page at a time, with buttons to jump forwards or back a page, as well as to the first and last pages of data. This accesses your .XML file to read the data itself. 39 See also: Excel Files in batch processing December, 2015. Page 105

121 106 Controller 5.28 scripting Scripts This option allows you to run a pre-prepared script. In the case below, sample_script.txt has requested two concordance operations, a word list, and a keywords analysis. The whole process happened without any intervention from the user, using the defaults in operation. file in The syntax is as suggested in the EXAMPLES visible above. (There is a sample_script.txt your Documents\wsmith6 folder). First the tool required, then the necessary parameters, each surrounded by double quotes, in any order. concord corpus="x:\text\dickens\hard_times.txt" node="hard" output="c: \temp\hard.cnc" made a concordance of the hard_times.txt text file looking for the search-word hard and saved results in c:\temp\hard.cnc concord corpus="x:\text\dickens\hard_times.txt" node="c:\temp\sws.txt" output="c:\temp\outputs.txt" 1_at_a_time="true" file, made a concordance of the same text file looking for each search-word in the sws.txt December, 2015. Page 106

122 107 WordSmith Tools Manual c:\temp\outputs.txt counted the number of hits and saved results in wordlist corpus="x:\text\shakes\oll\txt\tragedies\*.txt" output="c:\temp \shakespeare.lst" made a word list of all the .txt text files in a folder of Shakespeare tragedies (not including sub- folders) and saved it. keywords refcorpus="j:\temp\BNC.lst" wordlist="c:\temp\shakespeare.lst" output="j:\temp\shakespeare.kws" made a key words list of that word list compared with a BNC word list and saved it. Two additional optional parameters not visible there are: TXT_format="true" . and 1_at_a_time="true" TXT_format is true, a Concord file will contain only the concordance lines, a KeyWords file only If the key words and their frequencies, and a WordList file only the words and their frequencies. If is true, a word-list will export separate results text file by text file. 1_at_a_time 1_at_a_time If is true, Concord will read search words from a text file and save summary results: concord corpus="x:\text\dickens\hard_times.txt" node="c:\temp\sws.txt" output="c:\temp\outputs.txt" 1_at_a_time="true" produced this in : c:\temp\outputs.txt x:\text\dickens\hard_times.txt hard 50 3 soft mean 54 9 empty fred 0 book 13 4 north* 2 south* collocate scripts It is also possible to run a script requesting the collocates of each word in a word-list. This syntax wordlist collocates of "c:\temp\shakespeare.lst" output="c:\temp \shakespeare\collocates" tells WordSmith to compute the collocates of each word in the shakespeare.lst word-list, and save results as plain text files, one per word, in the c:\temp\shakespeare\collocates folder. The texts to be processed are the same text files used when the word list was created (and must still be present on disk to work, of course). Settings affecting the process are shown below. The first 6 have to do with the words from the word-list, and the min. in collocate-list refers to how many collocates of each word-list word are needed (here 10) for processing to be reported. Min. total column of a collocation display. column refers to the number in the total December, 2015. Page 107

123 108 Controller Results look like this: Here they're incomplete because I pressed the Stop button. 181 Each of these lists has the collocates output much as in a collocates display , but with the relationships also computed. The process only saves results where the settings shown above are met and where the relationships 395 also meet the requirements as in the WSConcgram settings. December, 2015. Page 108

124 109 WordSmith Tools Manual 427 See also : drag and drop 5.29 searching 5.29.1 search for word or part of word All lists allow you to search for a word or part of one, or a number. The search operates on the of data, though you can change the choice as in this screenshot, where current column Concordance is selected. The syntax is as in Concord. In the case of a concordance line, the search operates over the whole context so far saved or retrieved. So although is visible in the context kept wondering (highlighted to show you where) the search has found the phrase state schools tested about 80 words before the search word wondering . To search again, press OK again... Whole word – or bung in an asterisk The syntax is as in Concord, so by default a whole word search. To search for a suffix or prefix, use the asterisk. Thus *ed will find any entry ending in ed ; un* will find any entry starting with un. *book* book in it ( book, textbook, booked. ) will find any entry with 109 110 . , Search & Replace See also: Searching by Typing search by typing 5.29.2 Whenever a column of display is organised alphabetically, you can quickly find a word by typing. As WordSmith you type, WordSmith has will get nearer. If you've typed in the first five letters and December, 2015. Page 109

125 110 Controller found a match, there'll be a beep, and the edit window will close. You should be able to see the word you want by now. 428 109 110 , Searching for a word or part of one , Search & Replace , See also: Edit v. Type-in mode 72 315 Editing , WordList sorting search & replace 5.29.3 113 , allow for searching and replacing. Some lists, such as lists of filenames The point of it If your text data has been moved from one PC to another, or one drive to another, it will be necessary to edit all the filenames if WordSmith ever needs to get at the source texts, such as 234 when computing a concordance from a word list .) Search & Replace for filenames If you are replacing a filename you will see something like this. We distinguish between the path C:\texts\BNC\spoken\s conv\KC2.txt and the file's individual name, so that for a case like and the path to it is C:\texts\BNC\spoken\s conv . KC2.txt the filename is To correct the path to the file, e.g. if you've moved your BNC texts to drive Q:\my_moved_texts you might simply replace as shown here Q:\my_moved_texts C:\texts will get and all the filenames which contain e.g. c:\texts \BNC\spoken\s conv\KC2.txt will become Q:\my_moved_texts\BNC\spoken\s conv \KC2.txt. To rename a filename only, change the radio buttons in the middle of the window and the search and replace operation will ignore the path but replace within the filename only. Search & Replace for other data Viewer and Text Aligner , In this case the search & replace isn't of filenames but in the case below in December, 2015. Page 110

126 111 WordSmith Tools Manual 109 of the actual text. Like a current column of data. search operation, the search operates on the The context line shows what has been found. The line below shows what will happen if you agree to the change. : make 1 change (the highlighted one), then search for the next one Yes Sk ip : leave this one unchanged, search for the next one : change without any check Yes All : stop searching... Sk ip All Whole word – or bung in an asterisk The syntax is as in Concord, so by default a whole word search. To search for a suffix or prefix, use the asterisk. Thus *ed will find any entry ending in ; un* will find any entry starting with un. ed *book* will find any entry with book in it ( book, textbook, booked. ) 315 Word lists can be sorted by suffix: see WordList sorting . 420 109 109 , Accented Characters & Symbols See also: Searching by Typing . , Searching with F12 5.30 selecting or marking entries 5.30.1 selecting multiple entries 270 (lemmatisation). It can be necessary to select non-adjacent entries e.g. for Joining How to do it To select more than one entry in a word-list, concordance, key word list etc, hold down Control first, and click in the number column at the left edge. Ctrl To select various entries in this detailed consistency list, I held down the key and clicked at the December, 2015. Page 111

127 112 Controller numbers 8, 9, 10 and 16. 112 Alternatively, to mark entries you can choose Edit | in the menu. Mark (F5) 5.30.2 marking entries Non-adjacent entries can be marked by clicking the word and pressing Alt+F5. The first one marked will get a green mark in the margin and subsequent ones will get white marks. December, 2015. Page 112

128 113 WordSmith Tools Manual After marking these, I added by selecting two more words then pressing Alt+F5 scholarship(s) again. To undo a specific entry, press Alt+F5 again. To un-mark all entries, use Shift+Alt+F5. To lemmatise all marked entries, press F4 after marking. 5.31 filenames tab This tab shows the text file name(s) from which your current data comes. You can edit these names if necessary (e.g. if the text files have been moved or renamed.) To do so, choose Replace ( ). 101 Afterwards, if you save the results , the information will be permanently recorded. 430 See also: finding source files . 5.32 settings 5.32.1 save defaults 4 in the WordSmith Tools Controller . Settings can be altered by choosing Colour Settings Any setting menu item in any Tool gives you access to these: December, 2015. Page 113

129 114 Controller General, Folders, Colours, Languages, Tags, Lists, Concord, KeyWords, WordList, Index, Advanced, WSConcgram These tabs allow you to choose settings which affect one or more of the Tools. 60 customise the default colours colours 431 set WordSmith so it "knows" which folders you usually use folders 124 419 435 character set languages & numbers, , treatment of hyphens default file extension 447 80 , printing restore last file general 132 tags to ignore, tag file, tag file autoloading, custom tagsets tags 133 120 for Concord, KeyWords and Wordlist stop lists 92 270 to match up, or lemma files files matching to mark lemmas in a word list, etc. 180 Concord number of entries, sort system, collocation horizons 245 235 , database & associate , max. p value procedure KeyWords 447 minimum frequencies, reference corpus filename 448 303 WordList word length & frequencies, type/token # , cluster settings 276 making a word-list index Index 31 Advanced advanced settings 12 utility WSConcgram for the concgram permanent settings and wordsmith6.ini file You can save your settings with a button at the top of the Controller file, installed when you installed WordSmith Tools. This Or by editing the wordsmith6.ini specifies all the settings which you regularly use for all the suite of programs, such as your text and 78 60 431 87 results folders , screen colours , fonts , the default columns to be shown in a concordance, etc. 115 You can restore the defaults . show help file I n the general tab of Main Settings you will see a checkbox called "show help file". If checked, this will always show this help file every time WordSmith starts up. The point of this is for users who only use the software occasionally, e.g. in a site licence installation. sayings Using Notepad, you can edit Documents\wsmith6\sayings.txt , which holds sayings that 4 appear in the main Controller window, if you don't like the sayings or want to add some more. December, 2015. Page 114

130 115 WordSmith Tools Manual site licence and CD-ROM defaults f you're running WordSmith straight from a CD-ROM, your defaults cannot be saved on it as it's I read-only; Windows will find a suitable place for wordsmith6.ini , usually off the root folder of My Documents . 124 431 The first time you use WordSmith, you will be prompted to choose appropriate Folders , Text 132 Save All Settings details etc. and Characteristics, Tag for future use. You can change settings and save them as often as you like. Similarly, on a network you will usually not be allowed to change defaults permanently, as this would affect other users. Your network administrator should have installed the program so that you , where it may be both read and altered. If WordSmith wordsmith6.ini have your own copy of wordsmith6.ini in that folder it will be able to use your personal Tools finds a copy of preferences. 5.32.2 restoring settings How to find it Main settings | Advanced Settings | Settings can be restored to default settings by choosing Advanced | Restore . The point of it... You may have changed settings and cannot recall how to undo them... December, 2015. Page 115

131 116 Controller Factory Defaults Restores your wordsmith6.ini file to factory condition. Re-starts WordSmith with all relevant boxes filled in accordingly. Warning Messages Removes any of the messages you have received and where you've ticked the "never show this again" box. Customised Layouts 87 Removes any of the layouts you have saved for the various type of data WordSmith shows. Colours 60 Re-sets colours to the factory defaults. 5.33 source text(s) The point of it... The aim is to be able to see the whole text file your data came from, with some relevant words highlighted. December, 2015. Page 116

132 117 WordSmith Tools Manual How to do it The Concord and KeyWords tools both have areas which can show the source texts which your data was produced from, visible by choosing the source texts tab, if your texts are still where they were when the data analysis was done. (If they have been moved you can try editing the Filenames 110 data to correct this.) In Concord, you need to double-click the relevant concordance line to get the source text to show. In each case the relevant key or search words will be highlighted if possible. In KeyWords you'll see the source text in the source texts tab space, or if there are various source texts listed in a special window (shown below). Menu options (right-click to see these) Copy, Print, Save As their names suggest these menu items let you copy, save or print any text you've selected or the whole text. For saving you will get a chance to decide whether as plain text or as Rich Text Format (.RTF) preserving font and colour information. Next, Previous These jump you through the text one highlighted word at a time. You should see how many highlighted there are in the status bar. Grey markup, Clear markup, Restore markup Grey mark up lets you grey out all < > sections. December, 2015. Page 117

133 118 Controller Clear mark up simply cuts the tags out. (needed if Once you have cut out markup, the clear mark up option changes to restore mark up Concord is to jump to the correct location). Greying out mark-up is quite slow if the text is extensive. This shot shows its progress: Double-clicking the status bar gives you a chance to stop the process. KeyWords List of Source Texts If you right-click this window, you get a chance to see which texts contain which key words December, 2015. Page 118

134 119 WordSmith Tools Manual by clicking Frequencies , giving results like this: and if you double-click a highlighted word ( in the example above), you will be shown the THINK source text ( A01.txt ) with that word highlighted. If you simply click the file-name December, 2015. Page 119

135 120 Controller you get to see the text with all its key words highlighted. 5.34 stop lists Stop lists are lists of words which you don't want to include in analysis. For example you might want to make a word list or analyse key words excluding common function words like the, of, was, is, it . To use stop lists, you first prepare a file, using Notepad or any plain text word processor, which specifies all the words you wish to ignore. Separate each word using commas, or else place each one on a new line. You can use capital letters or lower-case as you prefer. You can use a semi- colon for comment lines. There is no limit to the number of words. Stop lists do not use wildcards 92 (match-lists may). There is a file called stop_wl.stp (in your \wsmith6 folder) which you could use as a basis and save under a new name. You'll also find basic_English_stoplist.stp there, based on top frequency items in the BNC. Or just make your own in Notepad and save it with as the file- .stp extension. If that is difficult, rename the .txt as .stp . Example ; My stop list for test purposes. THE,THIS,IS IT WILL Then select in the menu to specify the stop list(s) you wish to use. Separate stop lists Stop List , it is WordList, Concord and can be used for the programs. If the stop list is activated KeyWords in effect: that is, the words in it will be stopped from being included in a word list. If you wish always December, 2015. Page 120

136 121 WordSmith Tools Manual 113 to use the same stop list(s) you can specify them in as defaults . wordsmith6.ini To choose your stop list, click the small yellow button in the screenshot, find the stop list file, then press Load . You will see how many entries were correctly found and be shown the first few of them. With a stop list thus loaded, start a new word list. The words in your stop list should now not appear in the word list. continuous Normally, every word is read in while making the word list and stored in the computer's memory without checking whether it's the stop list. Eventually the set of words is checked in your stop list and omitted if it is present. That is much quicker. However, it means that for the most part, any 298 are computed on the whole text, disregarding your stop list. statistics continuous the processing will slow down dramatically since as every word is read in If you choose while making the word list, it will be checked against the stop list and ignored if found. In other December, 2015. Page 121

137 122 Controller and every single case THE and words, of IS etc. will be looked at as the texts are read in and OF sought in your stop list. The effect will be to give you detailed statistics which ignore the words in the stop lists. subtract wordlengths in statistics If you have not chosen continuous processing as explained above, you may want the statistics of your word list to attempt to deal in part with the stop list work done. With this choice, after the word list is computed, all the statistics concerning the number of types and tokens and 3-letter, 4-letter words etc. will be adjusted for the overall column (but not for the column for each single text) in your 298 . statistics 92 for a more detailed explanation, with screenshots. See Match List Another method of making a stop list file is to use WordList on a large corpus of text, setting a high minimum frequency if you want only the high-frequency words. Then save it as a text file. Next, use 362 Text Converter stoplist.cod as the Conversion file . the to format it, using stop lists in Concord In the case of Concord a stop list can do two jobs: first, it will cut the stop list words out as collocates. Additionally it can cut out any stop list words as search-word hits: for example if you and beautiful is in your stop list, any concordance lines containing beaut* concordance will get cut out (those containing will remain). For this to be activated, make beautiful beauty sure you check the search-word box in the settings. Stop lists ... are accessed via an Advanced Settings button in the Controller December, 2015. Page 122

138 123 WordSmith Tools Manual 92 141 274 , Lemmatisation , Match List . See also: Making a Tag File 5.35 suspend processing As WordSmith works its way through text files, or re-sorting data, you will see a progress window in button, too. the Controller with horizontal bars showing progress. If appropriate there'll be a Suspend Pressing this offers 4 choices: carry on ... as if you had not interrupted anything stop after this file Finishing the file means that you can keep track of what has been done and what there wasn't time December, 2015. Page 123

139 124 Controller for. (How? By examining the filenames in the word list, concordance or whatever you have just been creating.) stop as soon as possible ...useful if you're ploughing through massive CD-ROM files. WordSmith will stop processing the current file in the middle, but will retain any data it has got so far. panic stop The whole Tool (Concord or WordList, or whatever) will close down and some system resources 4 447 memory may be wasted. The Controller will not be closed down. Suspend again to effect your choice. Press text and languages 5.36 These settings affect how WordSmith will handle your texts. At the top, you see boxes allowing you to choose the language family (eg. English) and sub-type (UK, Australia etc.). These choices are determined by the preferences you have previously set. That is, the expectation is that you only work with a few preferred languages, and you can set these preferences once and then forget about 7 them. You do this by pressing the Edit Languages button. December, 2015. Page 124

140 125 WordSmith Tools Manual The choices below may differ for each language: hyphens and numbers You can also specify whether hyphens are to count as word separators. If the hyphen box is will be treated as two words. checked [X], self-access Should numbers be included in a word-list as if they were ordinary words? If you leave this checkbox blank, words like $300, 50.3M or 10th will be ignored in word lists, key words, concordances etc. and replaced by a #. If you switch it on, they will be included. SI numbers : the International System of Units (SI) stipulates that "Spaces should be used as a thousands separator (1 000 000) in contrast to commas or periods (1,000,000 or 1.000.000) to reduce confusion resulting from the variation between these forms in different countries." So numbers like 1,234,567.89 would be written 1 234 567.89. If you wish WordSmith to recognise such forms as one number each, leave this box checked, otherwise such a form in text would be counted as three successive numbers (1, 234, and 567.89). characters within word WordSmith automatically includes as valid alphabetical symbols all those determined by the operating system as alphabetical for the language chosen. So, for English, A to Z and common é . For Arabic or Japanese, whatever characters Microsoft have determined count accents such as as alphabetic. But you may wish to allow certain additional characters within a word. For example, in English, the father's apostrophe in is best included as a valid character as it will allow processing to deal with the whole word instead of cutting it off short. (If you change language to French you might not want apostrophes to be counted as acceptable mid-word characters.) Examples: ' (only apostrophes allowed in the middle of a word) (both apostrophes and percent symbols allowed in the middle of a word) '% (both apostrophes and underscore characters allowed in the middle of a word) '_ You can include up to 10. If you want to allow fathers' too, check the allow to end of word box. If this is checked, any of these symbols will be allowed at either end of a word as long as the character isn't all by itself (as ). in " ' " Plain Text/HTML/SGML 131 in HTML, SGML or XML our texts may be Plain Text in format: the default. If they are tagged Y 435 you should choose one of the options here. That way, the Tools can make optimum use of sentence, paragraph and heading mark-up. (start & end of) headings For the Tools to count headings, they need to know how to recognise the start and end of one. If 131 and , type and in here. (# stands for any your text is tagged e.g. with

is not the same as digit, ## for two, etc.) Whatever you type is case sensitive: . (If 435 and sometimes

, text which is not consistent, using sometimes you have HTML 9 to make your texts consistent). then use Text Converter sections (start & end of) and
, the Tools will treat identify sections. Again, If these boxes contain eg. whatever you type is case sensitive. December, 2015. Page 125

141 126 Controller (start & end of) sentences 425 auto , the Tools will treat sentences as defined If this space contains the word (ending with a full stop, question mark or exclamation mark, and followed by a capital letter), but if your text is tagged 131 and e.g. with , type those in here. Again, whatever you type is case sensitive. (start & end of) paragraphs For the Tools to recognise paragraphs, they need to know what constitutes a paragraph start and/or end, e.g. a sequence of two s (where the original author pressed Enter twice) or an 131 followed by a . For that you would type e.g. with . If your text is tagged


, you can type the tag in here. Case sensitive, too. In many cases you may consider that defining a paragraph end will suffice (considering everything up to it to be part of the preceding one). Much HTML text does not consistently distinguish between paragraph starts and ends. instead of

, but you can leave

here as Note that spoken texts in the BNC use instead if the text has no in it. WordSmith will use

7 120 131 . Processing text in Chinese , Choosing a new language , Stop Lists See also: Tagged Text . etc text dates and time-lines 5.37 The point of it ... -- that is, studying change through time. diachronically The idea is to be able to treat your text files You might want a concordance, for example, to be ordered by the text date. Or you might be interested in knowing when a certain word first appeared in your corpus and whether it gained web popularity in succeeding years. Or whether the collocates of a word like changed before 1990 and after. This screenshot shows a time-line based on concordancing energy/emissions/carbon in about 30 million words of UK newspaper text dealing with climate change, 2000-2010. December, 2015. Page 126

142 127 WordSmith Tools Manual The first line shows overall data where all results on three search-terms are merged. Concordance hits are represented as a graph with green lines and little red blobs for each time period. The grey rectangles and the grey graph line both represent the same background information, namely the amount of word-data searched. The difference is merely that the grey line is twice as high as the rectangles below it. The number of hits in each year is mostly roughly proportional to the amount of text being examined, though in 2006 and 2009 for the term it seems that the hit rate was slightly emissions higher. In the first half of the decade carbon was rather under-mentioned in proportion to the amount of climate-change data being studied. 48 See also choosing text files: setting file dates 5.38 window management 4 will be at the top left corner of your screen, half the screen The main WordSmith Tools Controller width and half the screen height in size. Other Tools will appear in the middle. Each Tool main window will come just below any previous ones. Make use of the Taskbar (or Alt-tab, which helps you to switch easily from one window to the next). "Start another Concord window"? to start another concordance. You will see this if you already have a window of data and press New windows open for each Tool, each with different data . You can have any number of minimising, moving and resizing windows windows can be stretched or shrunk by putting the mouse cursor at one edge and pulling. They All can be moved most easily by grabbing the top bar, where the caption is, and pulling, using the mouse. You can minimise a window: it becomes an icon which you restore by clicking on it. If you maximise it, it will fill the entire screen of the Tool concerned. These are standard Windows 4 window Controller functions. It's okay to minimise the main when using individual Tools. tile and cascade or Cascade the Tools from the main WordSmith Tools program. You can Tile restore last file A convenience feature: the last file you saved or retrieved will by default be restored when you re- enter WordSmith Tools. I've kept it to one only to avoid screen clutter! This feature can be turned off (in yo ur Documents temporarily via a settings option or permanently in wordsmith6.ini folder). You can also generally access your last saved result in any Tool by right-clicking \wsmith6 and choosing last file: December, 2015. Page 127

143 128 Controller 5.39 word clouds The point of it... Many of the lists in WordSmith offer a word cloud feature similar to those you have probably seen on 450 . the web. The idea is to promote pattern-noticing How to get here This function is accessed from the Compute menu, sub menu-item Word Cloud ( ) in the various Tools. Examples 189 189 you can get a word cloud based on any column of data. In this case of collocates of cash With Concord clusters based on , this example was computed: cash In the case of key words, you can get something like this: December, 2015. Page 128

144 129 WordSmith Tools Manual In this case the word cloud was based on the key words of a novel, Bleak House (Charles Dickens). The highlighted word Guppy is the name of one of the characters and details of this word are shown to the right. What you can see and do The Copy and Print buttons do what their names suggest. The Refresh button recalculates the cloud, e.g. after you have deleted items in the original data. As your mouse hovers over a word in the cloud you get details of that individual word. You can change the word cloud settings in the main Colours setting in the Controller. The font sizes range from a minimum of 8 to a maximum of 40 depending on the range of values in 78 is the one you may choose for any of your standard displays. your data. The font 5.40 zap unwanted lines To restore the correct order to your data after editing it a lot or marking lines for deletion, press the Zap button ( or Ctrl+Z). This will permanently cut out all lines of data which you have deleted (by pressing Del) unless you've restored them from deletion (Ins). In the case of a word list, it will also re-order the whole file in correct frequency order. Any deleted s may still be entries are lost at this stage. Any which have been assigned as lemmas of head word viewed, before or after saving. However, after zapping, lemmas can no longer be undone. In the case of a concordance, you may wish the list of filenames to be re-computed to reflect only the files still referred to in your concordance. To do that, choose Compute | Filenames . 70 . See also : reduce data to N entries December, 2015. Page 129

145 WordSmith Tools Manual Tags and Markup Section VI

146 131 WordSmith Tools Manual Tags and Markup 6 6.1 overview What is markup for? Marked up text is text which has extra information built into it with tags, e.g. "We like spaghetti.". You may wish to concordance words or tags... You may wish to see this additional information or ignore it, so that you just see the plain text ("We like spaghetti."). WordSmith has been designed so that you can choose what to ignore and what to see. 435 tags or entity references: if your text has É You may want to translate HTML or SGML É you probably want to see . You may wish to select within text files, e.g. cutting out a header or getting only the conclusions, instead of using the whole text. And you might want to get WordSmith to choose only files meeting certain criteria, e.g. having " " in a text file header section, where the speaker is a woman. sex=f You can see the effect of choosing tags if you select the Choose Texts option, then press the View 390 button. Any retained tags will be visible, and ignored tags replaced by spaces. Tags and Markup Settings ... are accessed via an Advanced Settings button in the Controller 141 132 , Handling Tags , Showing Nearest See also: Guide to handling the BNC , Making a Tag File 145 199 212 198 , , Types of Tag , Tag Concordancing , Concord Sound and Video Tags in Concord 153 311 134 390 , Tags in WordList , Using Tags as Text Selectors Viewing the Tags , XML text December, 2015. Page 131

147 132 Tags and Markup 6.2 choices in handling tags ignore all tags Specify all the opening and closing symbols in Main Settings | Advanced | Tags |Mark -up to ignore and such tags will be simply left out of word lists and concordances, as if they weren't in the original text files. example : <*> < symbol and ending at This will cut out all wording starting at each > symbol (up to 200 characters apart). (You can put more the next than one pair of brackets, e.g. <*>;[*] if you like.) ignore some tags and retain others 141 If you want to ignore some but retain others, you will need to prepare a tag file which lists all those you want to keep. These will then appear in your word lists and concordances. You get WordSmith Tools to read this text file in by choosing the Tag File menu option under Settings. Such tags will then be incorporated into your word lists, concordances, etc. as if they were ordinary words or suffixes. <*> , <body> and example: supposing you've set as "tags to ignore", but listed as tags to retain in your tag file, WordSmith will keep any instances of <title> <conclusion> , <body> or <conclusion> in your data but will ignore <introduction>, <Ulan Bator>, <threat> , etc. Tags to retain will only be active if there's a file name visible and you have pressed the Load or Clear button. If you press Load , you will see which tags have been read in from the tag file. translate entity references into other characters 435 tagged text, you may want to translate symbols. For example, If you use XML, SGML or HTML 141 SGML, XML, HTML use instead of a long dash. To do this, first prepare a Tag File — which contains the strings you want to translate. Then choose Main Settings | Advanced | Tags & and choose your entity file. WordSmith will then Mark up | Entity File (entities to be translated) translate any entity references in this file into the corresponding characters. to load up these tag files automatically 133 . See Custom Settings 141 131 , Overview of Tags , Making a Tag File , Showing Nearest See also: Guide to handling the BNC 390 199 198 145 Tags in Concord , Using Tags as , Viewing the Tags , Types of Tag , Tag Concordancing 315 134 , Tags in WordList Text Selectors December, 2015. Page 132</p> <p><span class="badge badge-info text-white mr-2">148</span> 133 WordSmith Tools Manual 6.3 custom settings Custom Tagsets In the main Settings | Tags & Mark up window, you may see custom settings choices like this. The point of it... The point of this choice is to change a whole series of settings according to the type of corpus you wish to process. When you change the setting above, any valid data as explained below will get loaded into your defaults. How to do it Press the Edit button to create or edit the custom settings (the file is called custom_tag_settings.xml and it'll get saved in your Documents\wsmith6 folder). December, 2015. Page 133</p> <p><span class="badge badge-info text-white mr-2">149</span> 134 Tags and Markup Add Shakespeare for processing the To start a new set, press and give a suitable name (such as Shakespeare corpus ). Fill in the boxes and press Save. All boxes will have leading and trailing spaces removed. · · Use auto for automatic processing e.g. of sentence ends. box means that this set gets chosen by default and any tag or entity files · Checking the default will get automatically loaded for you. 134 See also : Tags as text selectors 6.4 tags as selectors Defaults 44 all sections of all texts selected in Choose Texts but cut out all angle- The defaults are: select bracketed tags. Custom settings December, 2015. Page 134</p> <p><span class="badge badge-info text-white mr-2">150</span> 135 WordSmith Tools Manual There are various alternatives in this box which help your choices with the boxes below. Choosing British National Corpus World Edition (as in the screenshot) will for example automatically put </ 133 into the Document header ends box below. You can also edit the options and their teiHeader> effects. Markup to ignore 435 files, leave something like [ ] or If you want to cut out unwanted tags eg. in HTML or < >; < > in . The "search-span" means how far should WordSmith look for a closing [ ] Mark up to ignore > after it finds a starting symbol such as symbol such as . (The reason is that these symbols < might also be used in mathematics.) Markup to INclude or EXclude December, 2015. Page 135</p> <p><span class="badge badge-info text-white mr-2">151</span> 136 Tags and Markup 141 Making a Tag File . See Entity file 141 Making a Tag File . See Text Files and Mark-up WordSmith to use tags to select one section of a text and ignore the rest. However, you can get texts: that is, get WordSmith to look This is "selecting within texts". You can also select between within the start of each text to see whether it meets certain criteria. 137 Main Settings | Advanced | Tags | Only If Containing These functions are available from or Only 138 . Part of File Document Header When you process a set of texts usually containing a standard header (e.g. a copyright notice) you may wish to remove it automatically. Ensure that some suitable tag is specified as above in the </teiHeader> example. (If you choose Custom Settings above, you will get suitable choices automatically.) The process cuts by looking for the Document header ends mark-up and deleting all text to that point. (If you have a header repeated in the same text file, WordSmith will need to be told what mark-up is used for Document 138 to get such headers removed.) too , and you will need to choose Only Part of File header starts 137 For more complex searches, you might want to choose the Only If Containing or Only Part of 138 buttons visible above. File The order in which these choices are handled If you choose either to select either between or within texts, WordSmith will check that each text file meets your requirements, before doing your concordance, word list, etc. It will 137 to check whether it contains the words you've specified; 1. Select between files 138 "; 2. Cut out any section specified as a "section to cut 138 ", cut out everything which is not within them; 3. If there are "sections to keep December, 2015. Page 136</p> <p><span class="badge badge-info text-white mr-2">152</span> 137 WordSmith Tools Manual 138 4. Cut start of each line , if applicable; 132 ; 5. Process any entity references you want to translate 132 any tags not to be retained (see the "Mark-up to ignore" section of the screenshot 6. Ignore above). 198 131 141 132 , , Tag Concordancing , Tag Handling , Making a Tag File See also: Overview of Tags 199 145 390 , Types of Tag , Viewing the Tags Showing Nearest Tags in Concord , Guide to handling the 153 , XML text BNC 6.5 only if containing... The point of it You might want to process only the speech of elderly men, or only advertising material, or only classroom dialogues. This function allows WordSmith to search through each text, e.g. in text headers, ensuring that you get the right text files and skip any irrelevant ones. Suppose you have a large collection of texts (e.g. the British National Corpus) and you cannot remember which of them contain English spoken by elderly men. sex=m for males, age=5 for speakers aged Knowing that the BNC uses stext> for spoken texts, 60 or more, you can get WordSmith to filter your text selection. It will search through the whole of 437 every text file (not just the tags or header sections, in fact the first 2 megabytes of the file) to check that it meets your requirements. You can specify up to 15 tags, each up to 80 characters in length. They will be case sensitive (i.e. by mistake). Age=5 you will get nothing if you type Horizontally, the options represent combinations linked by "or". Vertically, the combinations are "and" links. The bottom set represents "but not" combinations. After your text files have been processed, you will be able to see which met your requirements in 44 50 the Text File choose window . and can save the list for later use as favourites Examples: December, 2015. Page 137</p> <p><span class="badge badge-info text-white mr-2">153</span> 138 Tags and Markup roses violets or seeds , and flowers must be You only want text files which contain either or garden and spade present too, so must lime juice to be present in the But you do not want text. If you want book or hotel but only if they're not in a text file containing publish or Booker hotel Prize : write book into the first box, in in the box beside it, and publish* and Booker * the first two boxes in the bottom row. 134 138 359 , Selecting within texts , Extracting text sections , Filtering See also: Tags as Selectors 354 360 , Guide to handling the BNC using Text Converter your text files part of file:selecting within texts 6.6 The point of it The aim is to let you get WordSmith to process only specific parts of your text files, getting rid of chunks you're not interested in. Cut out or Keep? Keep tab to choose to cut out certain sections, and/or only to use certain sections. or Cut Press the December, 2015. Page 138</p> <p><span class="badge badge-info text-white mr-2">154</span> 139 WordSmith Tools Manual Sections to Cut Note: if you only want to remove a document header such as </header> , it is easier to do that in 134 the general tag settings , section Document Header. For more complex choices, you may here specify what is to be cut, where it starts (for example <introduction> ) and where you want to cut to (e.g. </introduction> ). You can choose to cut out up to 7 different and separate sections ( <HEAD> to </HEAD> ). This <BODY> to or </BODY> function is case-sensitive and cuts out any section located as many times as it is found within the whole text. Cut start of each line/paragraph The point of this is that some corpora (e.g. LOB) have a fixed number of line-detail codings at the start of each line. Here you want to cut them out (that is, after every <Enter>). Choose the number of characters to cut, up to 100; the default is 0. Use -1 if you want to cut everything up to the first alphabetical character at the start of each line, and -2 to cut everything up to the first tab. Sections to Keep (contexts) December, 2015. Page 139</p> <p><span class="badge badge-info text-white mr-2">155</span> 140 Tags and Markup You want to select just one or two sections of the text and cut out the rest. Specify one tag to define the desired start, and one to specify the end, e.g. Body> <Intro> to < Mary> </Mary> (these would get all of Mary's (these would analyse only text introductions), or < to contributions in the discourse but nothing else). <Peter> to Here we have chosen to use 2 different sections, to get the sections </Peter> spoken by Peter and to </Hong Kong> to get the sections marked up as referring <Hong Kong> to Hong Kong as well. < or > symbol to define each Naturally you must be sure that there is something unique like a <PETER> section. This function is case sensitive (so it would not find ). 435 If you used to </H1> with this function in HTML text you'd get all the major headings in <H1> your texts, however many, but nothing else. <INTRO> The "off" switch doesn't have to look like the "on" switch -- you could keep, for example, to </BODY> and thereby cut out the conclusion if that comes after the </BODY> . Ignore text files not containing choices If this is checked, your text files will be examined to ensure they contain the mark-up for sections to <Peter> and <Hong Kong>) . keep (here OK Once you've pressed OK, you will see that WordSmith knows you want only certain parts of each file because the Only Part of File button goes red (as will the Only if Containing button if there were December, 2015. Page 140</p> <p><span class="badge badge-info text-white mr-2">156</span> 141 WordSmith Tools Manual sections to keep and the Ignore text files not containing choices box was checked). 134 137 , Guide to handling the BNC , Only if containing <x> See also: Tags as Selectors . 6.7 making a tag file Tag Syntax Each tag is case sensitive. and end with > but the first & last characters of the tag can be any Tags conventionally begin with < symbol. You can use * to mean any sequence of characters; to mean any one character; ? # to mean any numerical digit. [ to insert comments in a tag file, since [ is useful as a potential tag symbol. You can Don't use # to represent a number (e.g. <h#> will pick up <h5>, <h1> , etc.). And use ? to represent use <?> <s>, <p> , etc.), or * to represent any number of characters any single character ( will pick up (e.g. <u*> will pick up <u who=Fred>, <u who=Mariana> , etc.). Otherwise, prepare your tag 120 . Stop Lists list file in the same way as for or any other plain text editor, to create a new .tag file. Write one entry on each Use notepad . line Any number of pre-defined tags can be stored. But the more you use, the more work WordSmith has to do, of course and it will take time & memory ... Mark-up to EXclude December, 2015. Page 141</p> <p><span class="badge badge-info text-white mr-2">157</span> 142 Tags and Markup <SCENE>A public library in London. A A tag file for stretches of mark-up like this bald-headed man is sitting reading the News of the World.</SCENE> where you want to exclude the whole stretch above from your concordance or word list, e.g. because you're processing a play and want only the actors' words. Mark-up to exclude will cut out the whole string from the opening to the closing tag inclusive. For the Shakespeare corpus , a set of tags to EXclude might be used. (The idea is not to process any stage directions when processing the Shakespeare corpus.) The syntax requires ></ or >*</ to be present. Legal syntax examples would be: <SCENE></SCENE> <SCENE>*</SCENE> <SCENE #>*</SCENE> <HELLO?? #>*</GOODBYE> is followed by 2 characters, a space and a number then (In this last example it'll cut only if <HELLO > , and if is found beyond that.) </GOODBYE> <SCENE>* </SCENE> won't work, because both parts of the tag must be on the same line. <SCENE>*<\SCENE> won't work, because the slash must be / . With your installation you will find ( Documents\wsmith6\sample_lemma_exclude_tag.tag ) in cluded, which cuts out lemmas if constructed on the pattern <lemma tag="*>*</lemma> , i.e. with the word tag , an equals sign and a double-quote symbol, regardless of what is in the double- quotes. Mark-up to INclude A tag file for tags to retain contains a simple list of all the tags you want to retain. Sample tag list December, 2015. Page 142</p> <p><span class="badge badge-info text-white mr-2">158</span> 143 WordSmith Tools Manual files for BNC handling (e.g bnc world.tag ) are included with your installation (in your Documents\wsmith6 folder): you could make a new tag file by reading one of them in, altering it, and saving it under a new name. 60 colour Tags will by default be displayed in a standard tag (default=grey) but you can specify the foreground & background for tags which you want to be displayed differently by putting /colour="foreground on background" e.g. <noun> /colour="yellow on red" Available colours: 'Black','White','Cream', 'Red','Maroon', 'Yellow', 'Navy','Blue','Light Blue','Sky Blue', 'Green','Olive','Dollar Green','Grey-Green','Lime', 'Purple','Light Purple', 'Grey','Silver','Light Grey','Dark Grey','Medium Grey'. The colour names are not case sensitive (though the tags are). Note UK spelling of "grey" and "colour". Also, you can put "/play media" if you wish a given tag, when found in your text files, to be able to 147 attempt to play a sound or video file . For example, with a tag like <sound *> /colour="blue on yellow" /play media and a text occurrence like <sound c:\windows\Beethoven's 5th Symphony.wav> or <sound http://www.political_speeches.com/Mao_Ze_Dung.mp3> 212 . you will be able to choose to hear the .wav or .mp3 file Finally, you can put in a descriptive label, using /description "label" like this: <w NN*> /description "noun" /colour="Cream on Purple" <ABSTRACT> /description "section" <INTRODUCTION> /description "section" <SECTION 1> /description "section" Tagstring_only tags You can also define two tags as ones you want to use to mark the beginnings and ends of what will be shown in a concordance using /tagstring_only as a signal. For example, if concordancing and , you may want to see only the title text containing titles marked out with text. You'd include in the tag file <title> /tagstring_only /tagstring_only in Concord's To get Concord to show only the text between these two, choose View | Tag string only menu. Section tag In the examples using "section", Concord's "Nearest Tag" will find the section however remote in the text file it may be. December, 2015. Page 143

159 144 Tags and Markup This is particularly useful e.g. if you want to identify the speech of all characters in a play, and have a list of the characters, and they are marked up appropriately in the text file. /description "section" /description "section" /description "section" Here is an example of what you see after selecting a tag file and pressing "Load". The first tag is a "play media" tag, as is shown by the icon. You can see the cream on purple colour for nouns too. The tag file ( BNC World.tag ) is included in your installation. Entity File (entities to be translated) If you load it you might see something like this: December, 2015. Page 144

160 145 WordSmith Tools Manual A tag file for translation of one entity reference into another uses the following syntax: entity reference to be found + space + replacement. Examples: É É é é the sample tag file for translation ( Documents\wsmith6 In the screenshot above, ) which is in cluded with your installation has been loaded. You could make a new \sgmltrns.tag one by reading it in, altering it, and saving it under a new name. 132 199 131 Handling Tags See also: Showing Nearest Tags in Concord , Tag , , Overview of Tags 198 145 134 390 Using Tags as Text Selectors , Guide Concordancing , Types of Tag , Viewing the Tags , to handling the BNC . 6.8 tag-types You will need to specify how each tag type starts and ends, and you should be consistent in usage. Restrict yourself to symbols which otherwise do not appear in your texts. eight special markers Eight kinds of marker may be marked as significant for word lists: those which represent starts and 147 147 147 147 ends of headings . Type these in the appropriate and paragraphs , sentences , sections 124 . spaces when selecting Text Characteristics December, 2015. Page 145

161 146 Tags and Markup 427 tags within 2 separators These tags are often used to signal the part of speech of each word; they're also widely used in 435

to switch HTML, XML, SGML to switch on Heading 1 style and for "switches", e.g.

it off again. You should use the same opening and closing symbols, usually some kind of brackets, 435 markup): for all your tags (as the British National Corpus does using SGML or XML . ,, entity references 435 HTML, XML and SGML use so-called entity references for symbols which are outside the été which represents standard alphabet, e.g. . été Specify these two types of markup by choosing Settings/Tag Lists, or Settings/Text Characteristics/ Tags. You will then see a dialogue box offering Text to Ignore and a Browse button. 132 option allows you to specify tags which you do not want to see in the The Tags to Ignore concordance or word list results. 141 The Tags to be INcluded option allows you to specify a tag file, containing tags which you do want to see in the concordance or word list results. 141 The Tags to be EXcluded option allows you to specify a different tag file, containing stretches of tags which you want to find and remove in the concordance or word list results. 132 The Tags to be Translated option allows you to specify entity references which you want to é . convert on the fly, such as multimedia markers Text files can be tagged for reference to sound or video files which you can hear or see. For example, a text might contain something like this: blah blah blah ... blah blah etc. A concordance on blah blah 147 . could pick up the tag so you can hear the source mp3 file. See defining multimedia tags 131 132 141 , Showing Nearest Tags in See also: Overview of Tags , Handling Tags , Making a Tag File 390 198 199 134 , Viewing the Tags Concord , Tag Concordancing , Using Tags as Text Selectors , 212 Concord Sound and Video , Guide to handling the BNC . (A particular sub-variety of tags within 2 separators sometimes used is tags with underscores at the left and space at the right like this He_PRONOUN entered_VERB the_DET room_NOUN . 125 To process these, you will need to declare the underscore a valid character , or else convert your 368 corpus to a format like. He entered the room .) 6.9 start and end of text segments WordSmith attempts to recognise 4 types of text segment: sentences, paragraphs, headings, sections. Processing is case sensitive. You can use and as strings representing is another option. an end of paragraph or a tab in your texts. For sentence ends, auto December, 2015. Page 146

162 147 WordSmith Tools Manual 81 . Define these in your language settings Sentences the end. If you leave the For example, might represent the beginning of a sentence and 426 choice as auto , ends of sentences are determined by according to the definition of a sentence 390 of handling sentence recognition.) which gives a approximation. (There is no 100% accurate way Paragraphs

might represent the beginning of a paragraph and

For example,

or the end. Headings the end. Note that the British For example, might represent the beginning and National Corpus marks sentences within headings. Eg. Introduction HXL . It seems odd for the one word Introduction to count as a sentence, so WordSmith in text does not use sentence-tags within headings. Sections

the end. For example, might represent the beginning and
etc. is encountered. ,

Each of these is counted preferably when its closing tag such as If there are no closing tags in the entire text then paragraphs will be counted each time the

opening paragraph tag is found. 199 132 131 See also: Overview of Tags , Handling Tags , Showing Nearest Tags in Concord , Tag 145 198 134 390 , Using Tags as Text Selectors Concordancing , Types of Tag , Viewing the Tags , Guide . to handling the BNC 6.10 multimedia tags In this screenshot you see an example of how to define your multimedia tags. This is accessed from Main Settings | Advanced | Tags | Media Tags . December, 2015. Page 147

163 148 Tags and Markup File Extensions The file extensions ( .wav, .mp3, .avi, etc.) define the file types which your computer can play. Of course this function does require your computer to be able to handle sound or video if it is to work -- Windows uses the file extension to know how to play it. Video files will require the free VLC Media Player to be installed (see http://www.vlcapp.com/) . Filename The sound or video file-name might be 1. specified in a tag 2. the same name as the text file-name but with another extension such as .wav 3. found in the tag and interpreted using a table you have created previously. To do this, make each line like this: =c:\my_corpus_sounds\angry_man.wav 560 2 =c:\my_corpus_sounds\happy_little_girl.mp3 980 2 then the desired value. where each line has the tag found in the text file, followed by = in the source text, the = character is sufficient to define where the start of = the filename begins. In this case, what follows is a web address. For a text containing tags like this , you'd put $$ to show the start of filename. For the 212 concordance example , soundfile= is adequate to identify where the filename begins. The media files folder will be needed (for cases 1 and 2 above) if the sound files are not stored in the December, 2015. Page 148

164 149 WordSmith Tools Manual same folder as your text files. How to play it start-mark, duration-mark (optional) You can indicate markers for start and duration if necessary. They would be needed if your tag contained e.g. or if the sound file is the same as the text file with a different extension. duration-mark as play= If so, you'd specify start-mark as start= (because that is how and they are marked in your text files) Times are measured in seconds. duration You can specify a default duration as in the screenshot: 6 seconds. More may be needed especially if the sound tags are not spaced closely together in the text file. If no start or duration indication is given, the whole sound or video file will be played. If there are no duration and start position markers, the first number will be interpreted as start position and the second as duration, so a tag like this: in your text file means "play c:\mysounds\talk.wav starting 15 seconds from the beginning and play for 5 seconds". If there's only one number as in , that means "play c:\mysounds\talk.wav starting 15 seconds from the beginning and play for the default number of seconds". defaults 113 The defaults are: play .mp3 and .wav files. Once you've completed this, save your defaults for next time. 212 215 See also: Sound and Video in Concord , Overview of Tags , Obtaining Sound and Video files 132 141 131 198 , , Tag Concordancing , Tag Handling , Making a Tag File 390 199 145 Showing Nearest Tags in Concord , Viewing the Tags , Types of Tag , Guide to handling the BNC 6.11 modify source texts The point of it... This function enables you to modify your original source texts as a result of concordance work you've done. In this way, your work can get saved in the source texts themselves. For example, you might want to save user-defined categories, or search-phrase results where you have decided a phrase is a multi-word unit. Note: this procedure does alter your source texts. Before each is altered for the very first time, it is extension) but any change to your source texts or .original backed up (original filename with corpora must be done with caution! December, 2015. Page 149

165 150 Tags and Markup User-defined categories 168 For example, suppose you have marked your concordance lines' Set column like this: where the first line with miracle pre-modifies the noun cure and is marked a djectival but the second is an ordinary noun, and wish to save this in your original source text files. How to do it Choose Compute | Modify Source Texts . and if you want to save the Set choices, choose OK here: December, 2015. Page 150

166 151 WordSmith Tools Manual and the set choices will be marked as in this example: (seen by double-clicking the concordance line to show the source text). Multi-word unit search phrase Alternatively if you choose the search-phrase option: and December, 2015. Page 151

167 152 Tags and Markup then any search word containing a space will have underscores (or whatever other character you choose above) in it to establish multi-word units: Here, the search word or phrase was Rio de Janeiro , and the result of modifying the source texts was this: Add Time & Date stamp option This keeps a log of all your changes, enabling the changes to be undone later. Initials option Adds your initials to the changes. Leave empty if not wanted. The tag above means a user whose initials were MS made this change and it was the 3rd change. December, 2015. Page 152

168 153 WordSmith Tools Manual To undo previous changes If you have used the "time and date stamp" option shown above, you will be able to undo the modifications. The undo window shows all your log. You can choose all those done on a certain day, or by the person whose initials are visible at the right. Here we see the 4 modifications changing Rio de Janeiro into Rio_de_Janeiro . 168 See also: user-defined categories 6.12 XML text What is XML? XML text has angle-bracketed mark-up which provides additional information. For example the British National Corpus has text which is structured like this: I mean , where do eating disorders come from ? ... signals a sentence signals that the next word is a pronoun (coded PNP ), head-word is "i", signals that the next word is a plural noun belonging to the head-word " disorder" and it's a substantive. c5="NN2" is another attribute. There can be attribute of the is an start-tag, hw="disorder"

169 154 Tags and Markup WordSmith's handling of XML By default, WordSmith simply ignores all the mark-up so a word list will only get the words in black inserted in it, a concordance will only see those words ( I mean, where do eating disorders come from? ). Searching using Attributes If you want to search for all instances of NN2 forms (plural nouns), you'd need to type * as your search-word and answer yes to the question as to whether you're concordancing on tags. You would get results like this: Hide the mark-up If you prefer not to see all that the mark-up in grey, choose to hide the undefined mark-up December, 2015. Page 154

170 155 WordSmith Tools Manual There is a button in the main tool which can show or hide mark-up, too. Asterisks in your search-word In the example above, we search on * because each start-tag where NN2 forms are found starts with and another asterisk because the word which follows will be right next to the > our corpus. For two successive parts of speech, * * looks for any article (the/a/an) followed by any singular count noun. A search on * where we are allowing NN1 or NN2 and requiring the hw to be player ,gets results like this: Another example Searching Italian .XML containing text like this: December, 2015. Page 155

171 156 Tags and Markup and wishing to find all cases of the ARTPRE part of speech, with the search-word specified like this and answering yes to this: we get a considerable concordance with entries like this: (I have no idea why there are % symbols in the source .XML, by the way.) See also : Handling the BNC December, 2015. Page 156

172 WordSmith Tools Manual Concord Section VII

173 158 Concord 7 Concord 7.1 purpose 159 a program which makes a concordance using plain text or web text files. Concord is 159 seek in all the text files To use it you will specify a search word or phrase , which Concord will you have chosen. It will then present a concordance display, and give you access to information about collocates of the search word, dispersion plots showing where the search word came in each file, cluster analyses showing repeated clusters of words (phrases) etc. The point of it... The point of a concordance is to be able to see lots of examples of a word or phrase, in their contexts. You get a much better idea of the use of a word by seeing lots of examples of it, and it's by seeing or hearing new words in context lots of times that you come to grasp the meaning of most of the words in your native language. It's by seeing the contexts that you get a better idea about how to use the new word yourself. A dictionary can tell you the meanings but it's not much good at showing you how to use the word. Language students can use a concordancer to find out how to use a word or phrase, or to find out example, it's through using a which other words belong with a word they want to use. For can describe , concordancer that you could find out that in academic writing, a , or show , paper claim believe or want (* this paper wants to prove that ...). though it doesn't Language teachers can use the concordancer to find similar patterns so as to help their students. They can also use Concord to help produce vocabulary exercises, by choosing two or three search- 168 97 . them out, then printing words, blanking through a database of hospital Researchers can use a concordancer, for example when searching , grease accident records, to see whether ladder . Or to examine fracture is associated with fall , . land ownership historical documents to find all the references to Online step-by-step guide showing how index 7.2 Explanations 455 What to do if it doesn't do what I want... 158 What is Concord and what's it for? 179 Collocation 181 Collocation Display 191 Plots December, 2015. Page 158

174 159 WordSmith Tools Manual 175 Clusters 207 Patterns Settings 44 Choosing texts 180 Collocate horizons 187 Collocate settings 163 Concordance settings 202 Context word 222 Main Controller Concordance Settings 199 Nearest Tag 159 Search word or phrase 198 Tag Concordancing 131 Tagged Texts 124 Text settings Procedures 165 What you can See and Do 220 Altering the View 168 Blanking Out a Concordance 208 Re-sorting a Concordance 207 Removing Duplicate lines 189 Re-sorting Collocates 168 User-defined categories 206 Editing Concordances 262 Merging Concordances 212 Sound and Video in Concord 2 see also : WordSmith Main Index 7.3 what is a concordance? might look a set of examples of a given word or phrase, showing the context. A concordance of give like this: ... could not give me the time ... ... Rosemary, give me another ... ... would not give much for that ... A concordancer searches through a text or a group of texts and then shows the concordance as output. This can be saved, printed, etc. 7.4 search-word or phrase search word syntax 7.4.1 By default, Concord does a whole-word non-case-sensitive search. Basic Examples finds search word book book Book BoOk or or December, 2015. Page 159

175 160 Concord book* book , book s, book ing, book ed *book textbook (but not textbook s ) b* banana, baby, brown etc. *ed walk ed, wanted, pick ed etc. bo* in book in, book s in, book ing in (but not book into ) book * hotel book a hotel, book the hotel, book my hotel bo* in* book in, book s in, book ing in, book into book? book , book s, book ; book . book^ book , book s b^^k book , back , bank , etc. ==book== book (but not BOOK or Book ) book/paperback book or paperback symbol meaning examples tele* * disregard the end of the word, *ness disregard a whole word *happi* book * hotel Engl??? ? any single character (including ?50.00 punctuation) will match here $# any sequence of numbers, 0 to # £#.00 9 Fr^nc^ ^ any single letter of the alphabet will match here ==French== case sensitive == ==Fr*== c:\text\frd.txt :\ means use a file for lots of search- words (see file-based 161 search_words ) may/can/will / separates alternative search- words. You can specify alternatives within an 80- character overall limit <> beginning & end of tags Advanced Search-word Syntax If you want to use *, ? , == , #, ^ , :\, >, < or / as a character in your search word, put it in double quotes. Examples: "*" Why"?" and"/"or ":\" "<" December, 2015. Page 160

176 161 WordSmith Tools Manual Don't forget that question-marks come at the end of words (in English anyway) so you might need *"?" If you need to type in a symbol using a code, they can be symbolised like this: {CHR(xxx)} where is the decimal number of the code. Examples: {CHR(13)} is a carriage-return, xxx {CHR(9)} {CHR(10)} which comes at the is a line-feed, is a tab. To represent {CHR(13)} end of paragraphs and sometimes at the end of each line, you'd type {CHR(10)} which is carriage-return followed immediately by line-feed. {CHR(34)} refers to double inverted commas. {CHR(46)} is a full stop. There is a list of codes at http://www.asciitable.com/ #x9 #x22 for double inverted commas. You can also use hex format for numbers, e.g. for tab, Tags You can also specify tags in your search-word if your text is tagged. Examples: meaning examples symbol * single common noun (BNC World) book, chair, elephant book, chair, * single common noun (BNC XML edition) elephant * book, chairs singular or plural common noun T or t table, teacher t* any single noun beginning with campaign * * two single common nouns in sequence manager 153 for XML formats see XML text handling 198 202 149 225 , , Modify source texts , Ignore punctuation See also: Tag Concordancing , Context Word 28 Wildcards file-based search-words 7.4.2 The point of it... To save time typing in complex searches. You may want to do a standard search repeatedly on different sub-corpora. Or as Concord allows an almost unlimited number of entries, you may wish to do a concordance 159 . involving many search-words or phrases The space for typing in multiple search-words is limited to 80 characters (including / etc.). If your preferred search-words will exceed this limit or you wish to use a standardised search, you can prepare a file containing all the search-words. How to do it... December, 2015. Page 161

177 162 Concord Documents\wsmith6\concordance_search_words.txt A sample ( ) is included with the distribution files. Use a Windows editor (e.g. Notepad) to prepare your own. Each one must be on a separate line of your file. No comment lines can be included, though blank lines may be inserted for readability. context:= as in this example: If you want to require a context for a given word, put book context:=hotel (which seeks book and only shows results if hotel comes in the context horizons). Then, instead of typing in each word or phrase in the Search Word dialogue box, just browse for the file. to read the entries (or Then press Load if you change your mind). Clear Lemmas and file-based concordancing 438 from WordList, and the highlighted word in the word Note that where Concord has been called up 270 , a temporary file will be created, listing the whole set of list is the head entry with lemmas lemmas, and Concord will use this file-based search-word procedure to compute the concordance. The temporary file will be stored in your Documents\wsmith6 folder unless you're running on a \windows\temp . It's up to you to network in which case it'll be in Windows' temporary folder, e.g. delete the temporary file. Automated file-based concordances If you want Concord to process a whole lot of different search-words, saving each result as it goes along so you can get a lot of work done with WordSmith unattended, choose SW Batch under December, 2015. Page 162

178 163 WordSmith Tools Manual 163 . Concordance Settings search-word and other settings 7.4.3 Search Word or Phrase and/or Tag 159 Type the word or phrase Concord will search for when making the concordance, or (below) the 161 435 name of a file of search words . You may also choose from a history list of your previous 159 or the set of examples shown in search words. For details of syntax, see Search Word Syntax this screenshot: 161 If you want to do many concordances in a file-based search , first prepare a small text file containing the search words, e.g. containing this that the other ==Major*== Press the file button to locate your text file, the press the Load button. This will then change its name to something like , where 4 means as in the example above that there are 4 different Clear 4 search-words to be concordanced. See "Batch" below for details on saving each one under a separate filename, otherwise all the searches will be combined into the same concordance. Advanced searches December, 2015. Page 163

179 164 Concord lemma list search 274 . If the lemma file you've loaded This option requires you to have chosen and loaded a lemma file speak -> speaks, spoke, spoken then if your search-word is speak , specifies for example the concordance will contain examples of all four forms. Context word(s) and search horizons You may wish to find a word or phrase depending on the context. In that case you can specify context word(s) which you want, or which you do not want (and if found will mean that entry is not used). For example, if the search word is and the context word is hotel , you'll get book, book* , but only if hotel is found within your Context books, booked, booking, bookable 202 . Or if the search word is book* and the exclude if box has hotel , you'll get Search Horizons , as long as book, books, booked, booking, bookable not found within your hotel is and the exclusion specifies fish *ish context search horizons. Or if the search word is , you'll get yellowish, greenish , etc. but not fish . book with a context word < ADJ>* in You may type tag mark-up in here too, e.g. search for position up to L3 will find book with a preceding adjective if your text has that sort of mark-up and if you've defined a tag file including . In the screenshot above you see that "stop at sentence break" has been selected, meaning that a collocation search will only go left or right of the search-word up to a sentence-end. This is further 188 explained here . December, 2015. Page 164

180 165 WordSmith Tools Manual Batch Suppose you're concordancing book* in 20 text files: you might want One concordance based on 39 which can be all 20 files (the default), or instead 20 separate concordances in a zipped batch 161 viewed separately (Text Batch). If you have multiple search-words in a file-based search as explained above, you may want each result saved separately (SW Batch ). Other settings affecting a concordance are available too: 420 222 ; Typing characters , see WordSmith Controller Concordance Settings 81 419 202 Accented characters ; Choosing Language , Context Word(s) & Context Search Horizons advice 7.5 You have a listing showing all the concordance lines in a window. You can scroll up and down and left or right with the mouse or with the cursor keys. Sort the lines If you have a lot of lines you should certainly sort them. A concordance is read vertically, not horizontally. You are looking for repeated patterns, such as the presence of lots of the same sorts of words to the right or left of your search-word. Click the bar at the top to start a sort. December, 2015. Page 165

181 166 Concord The Columns These show the details for each entry: the entry number, the concordance line, set, tag, word- position (e.g. 1st word in the text is 1), paragraph and sentence position, source text file name , and how far into the file it comes (as a percentage). See below for an explanation of the purple blobs . The statistics are computed automatically based on the language settings. Set This is where you can classify the entries yourself, using any letter, into user-defined 168 categories . Supposing you want to sort out verb uses from noun uses, you can press V or N. To type more (eg. "Noun"), double-click the entry in the set column and type what you 159 want. If you have more than one search-word , you will find the Set column filled with the search-word for each entry. To clear the current entry, you can type the number 0. To clear the whole Set column, choose Edit | Clear Set column. Tag 199 . This column shows the tag context More context? Stretching the display to see more You can pull the concordance display to widen its column. Just place the mouse cursor on the you can pull bar between one column and another; when the cursor changes shape the whole column. Stretch one line to see more context The same applies to each individual row: place the mouse cursor between one row and another in the grey numbered area, and drag. (F8) to "grow" all the rows, or (Ctrl+F8) to shrink them. Or press Or press the numeric key-pad 8 to grow the current line as shown below. (Use numeric key-pad 2 to shrink it.) December, 2015. Page 166

182 167 WordSmith Tools Manual Viewing the original text-file ( if it is still on the disk where it was when the concordance was originally created) Double-click the concordance column, and the source text window will load the file and 116 . highlight the search word Or double-click the filename column, it will open in Notepad for editing. Other things you may wonder about Weird purple marks In the screenshot you will see purple marks where any column is not wide enough to show all the data. The reason is that numbers are often not fully visible and you might otherwise get the Word # 4,569 wrong impression. For example in the concordance below, the column shows 14,569 but the true number might be . Pull the column wider and the purple lines disappear. Status bar 449 The status bar panels show the number of entries (1,000 in the "stretch one line" screenshot above) · whether we're in "Set" or "Edit" mode; · · the current concordance line from its start. See also: 208 your concordance lines Re-sorting 171 searches Follow-up 168 User-defined categories 220 Altering the View 168 the search-word Blanking out December, 2015. Page 167

183 168 Concord Padding the search-word with spaces (use the search-word padding menu item to put a space on each side of the search-word) 179 (words in the neighbourhood of the search-word) Collocation 191 Plot (plots where the search-word came in the texts) 175 (groups of words in your concordance) Clusters 218 Text segments in Concord 206 Editing the concordance 126 Time-lines 206 Zapping entries 211 Saving and printing 127 Window Management blanking 7.6 In a concordance, to blank out the search-words with asterisks, just press the spacebar (or ). Press it again to restore them. choose View | Blank ed out The point of it... and give A blanked-out concordance is useful when you want to create an exercise. This one has put mingled: ... could not ********** me the time ... ... Rosemary, ********** me another ... ... would not ********** much for that ... ... could not ********** up with him ... ... so you'll ********** him a present ... ... will soon ********** up smoking ... ... he should ********** it over here ... Concord will give equal space to the blanks so that the size of the blank doesn't give the game away. 222 See also : Other main Controller settings for Concord Category or Set 7.7 set column categories 7.7.1 The point of it... You may want to classify entries in your own way, e.g. separating adjectival uses from nominal ones, or sorting according to different meanings. December, 2015. Page 168

184 169 WordSmith Tools Manual Here the user has used P where may has to do with probability and M if it's a month. In addition, 149 some items have been labelled in more detail. You may wish also to modify your original texts to include this annotation work you've done. How to do it 428 mode is on Set you will If you simply press a letter or number key while the edit v. set v. type-in get the concordance line marked with that letter or number in the Set column. 208 You can sort the concordance lines using these categories, simply by clicking on the header Set, which will have a small triangle showing it's sorted. 112 them, then choose To enter the same value for various rows, first select the rows or mark Set column | Edit December, 2015. Page 169

185 170 Concord then type in a suitable value. Colours If you want to type something longer and optionally in a specific colour, double-click the set column and you'll get a chance to type more. Here the word permission has been typed and the colour 79 has been dragged onto the box. Clearing the Set column December, 2015. Page 170

186 171 WordSmith Tools Manual To correct a mistake, press the zero key; that will remove any text and colour from the selected 168 entry. (If you press the spacebar you will get blanking .) 51 428 149 mode. , edit v. type-in See also : Colour categories , modify your source texts colour categories 7.7.2 The point of it... The idea is to follow up a large concordance by breaking it down into specific sub-sections, so one can see how many of each sub-type are found in the whole list. Example The screen-shot below came from a concordance of beautiful in Charles Dickens: There are 774 lines. Looking through them, it became apparent that Dickens was fond of the beautiful creature , but how many are of beautiful and collocations beautiful face so.. beautiful a creature or similar (such as creature in line 1) and what proportion of the lines is that? How to do it December, 2015. Page 171

187 172 Concord Choose Compute | Colour Categories in the menu. which opens up the colour categories box: December, 2015. Page 172

188 173 WordSmith Tools Manual Here we have completed the spaces so as to get cases of beautiful ... with creature up to 4 words away to the right, and chosen December, 2015. Page 173

189 174 Concord to colour yellow any which meet this condition. On pressing OK we find out there are 16, representing just over 2% of the lines. and looking at the concordance the first line is now marked: Where are the other 15? To find them, simply sort on the Set column. December, 2015. Page 174

190 175 WordSmith Tools Manual This function applies to word lists and other data too, and is explained in more detail the main colour 51 section. The set column itself can contain characters or words as well as colours, as categories 168 section. explained in the set column 7.8 clusters The point of it... These word clusters help you to see patterns of repeated phraseology in your concordance, especially if you have a concordance with several thousand lines. Naturally, they will usually contain the search-word itself, since they are based on concordance lines. 207 which helps you see patterns is Patterns . Another feature in Concord How it does it... 222 settings for Clusters are computed automatically if this is not disabled in the main Controller Concord ( Concord Settings ) where you will see something like this: December, 2015. Page 175

191 176 Concord where your usual default settings are controlled. "Minimal processing", if checked, means do not compute collocates, clusters, patterns etc. when computing a concordance. (They can always be computed later if the source text files are still present.) Clusters are sought within these limits: default: 5 words left and right of the search word, but up to 25 left and 25 right allowed. The default is for clusters to be three words in length and you can choose how many of each must be found for the results to be worth displaying (say 3 as a minimum frequency). Clusters are calculated using the existing concordance lines. That is, any line which has not been deleted or zapped is used for computing clusters. 188 278 , the idea of "stop at sentence breaks " (there are other As with WordList index clusters alternatives) is that a cluster which spans across two sentences is not likely to make sense. Re-computing clusters ) ( The default clusters computed may not suit, (and you may want to recompute after deleting Compute | Clusters some lines), so you can also choose ) in the Concord menu, so as to ( choose how many words a cluster should have (cluster size 2 to 4 words recommended), and alter the other settings. December, 2015. Page 176

192 177 WordSmith Tools Manual When you press OK, clusters will be computed. In this case we have asked for 3- to 5-word clusters and get results like this: The clusters have been sorted on the Length column so as to bring the 5-word clusters to the top. At the right there is a set of "Related" clusters, and for most of these lines it is impossible to see all of their entries. To solve this problem, double-click any line in the Related column and another window opens. Here is the window showing what clusters are related to the 3-word cluster, the cause of , which is the most frequent cluster in this set: December, 2015. Page 177

193 178 Concord "Related" clusters are those which overlap to some extent with others, so that the cause , etc. The procedure seeks out cases where of overlaps with devoted to the cause of the whole of a cluster is found within another cluster. 448 278 128 See also: general information on clusters , WordList Clusters , Word Clouds . December, 2015. Page 178

194 179 WordSmith Tools Manual Collocation 7.9 what is collocation? 7.9.1 What's a "collocate"? Collocates are the words which occur in the neighbourhood of your search word. Collocates of might include post, stamp, envelope , etc. However, very common words like the will also letter letter collocate with . and "colligation"? Linkages between neighbouring words which involve grammatical items are often referred to as colligation rely is typically followed by a preposition in English is a colligational fact. . That The point of it... The point of all this is to work out characteristic lexical patterns by finding out which "friends" words typically hang out with. It can be hard to see overall trends in your concordance lines, especially if there are lots of them. By examining collocations in this way you can see common lexical and grammatical patterns of co-occurrence. Options 226 minimal processing You may compute a concordance with or without collocates ( ): without is slightly quicker and will take up less room on your hard disk. The default is to compute with collocates. 180 The number of collocates stored will depend on the collocation horizons . You can re-compute collocates after editing your concordance. 92 120 or stop list . If you want to filter your collocate list, use a match list 189 a collocate list in a variety of ways. Re-sort 289 between the word and the search-word which the You can see the strength of relationship concordance was based on. 181 after the concordance has been computed. Collocates can be viewed Technical Note 417 on collocation has never distinguished very satisfactorily between collocates T he literature which we think of as "associated" with a word (letter - stamp) on the one hand, and on the other, the words which do actually co-occur with the word (letter - my, this, a, etc.). We could call th e first type "coherence collocates" and the second "neighbourhood collocates" or "horizon collocates". It has been suggested that to detect coherence collocates is very tricky, as once we start looking beyond a horizon of about 4 or 5 words on either side, we get so many words that there is more noise than signal in the system. 241 ou to study Associates , which are a pointer to "coherence collocates". allows y KeyWords and Concord allow you also to Concord will supply "neighbourhood collocates". WordList 289 . study relationships between words 187 181 180 See also: collocation display , collocation settings , collocation relationship , relationships 289 . between words display December, 2015. Page 179

195 180 Concord 7.9.2 collocate horizons The collocate horizons represent the number of collocates Concord will find to the left and right of 247 KeyWords your search word, and the distance used by in searching out plot-links . The default 113 is 5L,5R (5 to left and 5 to right) but you can go up to 25 on either side. You can set whether to 188 set collocation boundaries such as sentence, paragraph breaks too. To set collocation horizon settings, choose Concord Main Controller Settings s and other Concord 222 4 Concord Settings . , or in the main WordSmith Controller , choose 187 See also: Collocate Settings 7.9.3 collocation relationship The point of it... how strongly each collocate relates to the search-word near which it was The idea is to find out 289 ) is not computed by default for a collocate list. MI (or other relevant statistic found. How to compute it | Relationships: In the Concord menu, choose Compute Steps Documents\wsmith6\text 1. Suppose you have made a concordance using all the files in . You see collocates such as love and have done a concordance on \shakespeare Romeo, etc. All these show a "Relation" score of "??" because they haven't yet hate, the, Juliet, Nurse been computed. 2. If you haven't done so yet, use WordList to make a word list of the same text files (or if you 447 447 file is what you ). Make sure the reference corpus prefer, use some other reference corpus prefer. 3. Now choose the menu item and Concord will use the reference corpus filename. It will look up each of your collocates in the word list and compute MI using the information in the reference corpus word list. 222 . You can choose a different statistic in the main Controller Concord settings Note: the procedure goes through your collocates and tries to find each in the word-list. If absent, , an Friar Lawrence you get a blank result. If one of your search-terms has a space in it such as ordinary single-word word list won't know its frequency and you will be asked to supply it. If you don't know, you should compute a concordance on that search-phrase over the same corpus first. December, 2015. Page 180

196 181 WordSmith Tools Manual Full lemma processing, case sensitive 270 entries, or it is a case-sensitive These are only relevant if your word list has any lemmatised word list and you wish processing to respect case-sensitivity. Relation statistic Choose which type of relation you wish to compute. The default is Specific Mutual Information but in the screenshot Z score has been chosen. Column for relation The default is "Total". If you choose Total you're computing the relationship across the current 180 collocation horizons set. If you prefer to examine the relationship at only one position instead, you may: 289 181 179 , Collocate display See also: Collocation , Mutual Information collocates display 7.9.4 Display The collocation display initially shows the collocates in frequency order. Beside each word and the search-word which the concordance was based on, you'll see the December, 2015. Page 181

197 182 Concord 289 strength of relationship between the two (or 0.000 if it hasn't yet been computed). Then, the total number of times it co-occurred with the search word in your concordance, and a total for Left and Right of the search-word. Then a detailed break-down, showing how many times it cropped up 5 words to the left, 4 words to the left, and so on up to 5 words to the right. The centre position (where the search word came) is shown with an asterisk. 180 The number of words to left and right depends . on the collocation horizons The numbers are: the total number of times the word was found in the neighbourhood of the search word the total number of times it came to the left of the search-word the total number of times it came to the right of the search-word a set of individual frequencies to the left of the search word (5L, i.e. 5 words to the left, 4L .. 1L) a Centre column, representing the search-word a set of individual frequencies to the right of the search word (1R, 2R, etc.) The number of columns will depend on the collocation word horizons. With 5,5 you'll get five columns to the left and 5 to the right of the search word. So you can see exactly how many times each word was found in the general neighbourhood of the search word and how many times it was found exactly 1 word to the left or 4 words to the right, for example. 60 (default= red ) . In the The most frequent will be signalled in most frequent collocate colour comes 44 times in total but of these are in position L1. differences 39 screenshot below, The screenshot above shows collocation results for a concordance of BETWEEN/AMONG sorted by column, where items like differentiate, difference the Relation etc. are found to be between . Further down the listing, some links concerning among most strongly related to ( growing, refugees ) are to be seen. December, 2015. Page 182

198 183 WordSmith Tools Manual 189 ) and you can recalculate the collocates ( ( The frequency display can be re-sorted ) if you 180 129 zap . entries from the concordance or change the horizons 186 in your concordance display. You can also highlight any given collocate 183 180 179 128 , Collocation Relationship See also: Word Clouds , Collocation , Collocates and Lemmas 289 , Mutual Information 7.9.5 collocates and lemmas 274 was used and lemma search specified, with a concordance on In the following case a lemma list : abandon the word with these results showing which form of the lemma was used in the Set column. December, 2015. Page 183

199 184 Concord ABANDON In the collocate window below, the red line in row 1 indicates that the 140 cases of include other forms such as 78 cases of ABANDONED and 19 of (greyed out below). ABANDONING The red mark by BE (row 5) shows that this row gives collocation numbers covering all forms of BE are lemmatised in this screenshot. such as WAS, WERE etc. Similarly , HAVE and A Thus, for your search-word and its variants you can see detailed frequencies, but its collocates, though they do get lemmatised, do not show you the variant forms or any specific frequencies. 7.9.6 collocate follow The point of it The idea (from Paul Raper) is to be able to follow up a collocate by requesting a new concordance based on it, in the same text files as selected for the collocate. This aids exploration of related words. How to do it . Select a word of interest, such Here is an example, where there is a collocate list relating to BEERS December, 2015. Page 184

200 185 WordSmith Tools Manual as KEG, menu. and select Follow collocate in the Compute WordSmith starts up a new Concord window with a search on KEG. December, 2015. Page 185

201 186 Concord The search is carried out on the most recently selected text files (selected using the file-choose 44 window or by reading in a saved concordance). 7.9.7 collocate highlighting in concordance The point of it... The idea is to be able to see a selected collocate highlighted in the concordance. In this example, the texts were Shakespeare plays and search word was love . One of the collocates is know , occurring a total of 50 times, with the most frequent at position 4 words to the left of love . Double-clicking 14 in the L4 column to the right of know , we see this in the concordance: December, 2015. Page 186

202 187 WordSmith Tools Manual We have brought to the top of the concordance those lines which contain in position L4. know How to do it In a collocates window or a patterns window, simply double-click the item you wish to highlight. Or select it and choose View | Highlight selected collocate . In the collocates window, if you click what you get the Word all instances of the word column or the Total column Total Left those to the left (33 in the case of know above) Total Right those to the right (17) otherwise those in that column only To get rid 208 Re-sor t in a different way or choose the menu item View | Refresh . 7.9.8 collocate settings 4 settings, in the main WordSmith To set collocation horizons and other Concord Controller menu at the top, choose Concord Settings . my in the concordance line will be treated like My ). Collocates are computed case-insensitively (so 120 THE to be included, use a stop-list . If you don't want certain collocates such as You can lemmatise (join related forms like SPEAK -> SPEAKS, SPOKE, SPOKEN ) using a December, 2015. Page 187

203 188 Concord 274 . lemma list file Minimum Specifications The minimum length is 1, and minimum frequency is 1 (default is 10). You can specify here how frequently it must have appeared in the neighbourhood of the Search Word. Words which only come once or twice are less likely to be informative. So specifying 5 will only show a collocate which comes 5 or more times in the neighbouring context. Similarly, you can specify how long a collocate must be for it to be stored in memory, e.g. 3 letters or more would be 3. Horizons you specify how many words to left and right of the Search Word are to be included in the Here collocation search: the size of the "neighbourhood" referred to above. The maximum is 25 left and 25 right. Results will later show you these in separate columns so you can examine exactly how many times a given collocate cropped up say 3 words to the left of your Search Word. 60 The m ost frequent will be signalled in the most frequent collocate colour (default= ). red Breaks These are which you will see in the bottom right corner of the screen visible in the Controller Concord 222 . Settings When the collocates are computed, if the setting is to stop at sentence breaks, collocates will be counted within the above horizons but taking sentence breaks into account. For example, if a concordance line contains source, per pointing integration times, respectively. However, when we compared these two maps , however and the search-word is only when we compared these two will be used for collocates because there is a sentence break to the left of the search word. If the setting is "stop at punctuation", then nothing will come into the collocate list for that line (because there is a more major break than punctuation to the left of it, and no word to the right of the search- word before a punctuation symbol. stop at end of text: end of text is by default assumed to be the end of the text file. stop at heading or section : this works by recognising ends of heading or section which you can specify in the text format box (language settings): December, 2015. Page 188

204 189 WordSmith Tools Manual 7.9.9 re-sorting: collocates The point of it... is to home in, for example, on the ones in L1 or R1 position. To find sub-patterns of collocation, so as to more fully understand the company your search-word keeps. Here the collocates of COULD in some Jane Austen texts show how negatives crop up a lot in R1 position. How to do it... just press the header The frequency-ordered collocation display can be re-sorted to reveal the frequencies sorted by their December, 2015. Page 189

205 190 Concord total frequencies overall (the default), by the left or right frequency total, or by any individual frequency position. Just press the header of a column to sort it. Press again to toggle the sort between ascending and descending. 186 , as in the You can also get the concordance lines sorted so as to highlight specific collocates case of the 70 cases of NEVER in R1 position in the screenshot. Word Clouds 128 You can also get a word cloud of your sorted column. In the screenshot below, a concordance 120 on cash generated these R1 collocates (with most function words eliminated using a stoplist ): and these data fed straight into a word cloud. December, 2015. Page 190

206 191 WordSmith Tools Manual In the word cloud, the mouse hovered over the word accounting so the details of that word are shown to the right. 181 180 128 179 , Patterns See also: Collocation , Collocation Display , Collation Horizons , Word Clouds 207 . dispersion plot 7.10 The point of it... This shows where the search word occurs in the file which the current entry belongs to. That way you can see where mention is made most of your search word in each file. Another case where the 450 aim is to promote the noticing of linguistic patterning . What you see The plot shows: source text file-name File number of words in the source text Words number of occurrences of the search-word Hits December, 2015. Page 191

207 192 Concord per 1,000 how many occurrences per 1,000 words 446 Dispersion the plot dispersion value Plot a plot showing where they cropped up, where the left edge of the plot represents the beginning of the text file ("Once upon a time" for example) and the right edge is at the end ("happily ever after". Though not in the case of Romeo and Juliet.). Here we see a plot of "O" and another of "AH" from the play Romeo and Juliet. They are on separate lines because there were 2 search-words. There are more "O" exclamations than "AH"s. As the status bar says, you can get the word numbers for the plot by double-clicking the plot area: Using View | Ruler , you can switch on a "ruler" splitting the display into segments. The plot below is of one search-word ( beautiful ) in lots of texts. December, 2015. Page 192

208 193 WordSmith Tools Manual The status-bar gives details of the highlighted text. Multiple Search-words or Texts If there are 2 or more search-words or texts, you will see something like this: where the File column supplies the file-name and the search-word in that order. If you want it with the search-word first, go to the Concord settings in the Controller, What you see, and click here: and re-sort the File list: December, 2015. Page 193

209 194 Concord Double-click to see the source text Just double-click in the File column: Uniform view There are two ways of viewing the plot, the default, where all plotting rectangles are the same length, or Uniform Plot (where the plot rectangles reflect the original file size -- the biggest file is longest). Change this in the View menu at the top. Here is the same one with Uniform plot. The blue edge at the right reflects the file size in each case. December, 2015. Page 194

210 195 WordSmith Tools Manual If you don't see as many marks as the number of hits, that'll be because the hits came too close together for the amount of screen space in proportion to your screen resolution. You can stretch the 102 plot by dragging the top right edge of it. You can export the plot using Save As and can get your 102 spreadsheet to make graphs etc, as explained here . Each plot window is dependent on the concordance from which it was derived. If you close the the plot. There's no Save option for the Print original concordance down, it will disappear. You can 422 Copy to the plot alone but you can of course save the concordance itself. You can clipboard (Ctrl+C) and then put it into a word processor as a graphic, using Paste Special. Advanced plots When you first compute a concordance, the plot will assume you want a dispersion plot of each text file on a separate line and each different search-word on a separate line as seen above. If you menu item have more than one text file or search-word, when you choose the Compute | Plot afterwards, you will get a chance to merge your plots and omit some text files or search-words . A first view of the plot settings may resemble this. All the files have by default been sorted into separate sets and so have all the search-words. The red colour indicates files or search-words which have been included in each list of sets at the right. December, 2015. Page 195

211 196 Concord Now if you Clear them, you can either select and drag or select and press the central button to get your preferred selections. (The button showing a green funnel will put all into one set, the other one will use one set for each, by the way.) December, 2015. Page 196

212 197 WordSmith Tools Manual Here is a set of preferences with lots of files and two search-words: giving results like this: December, 2015. Page 197

213 198 Concord 446 60 . , plot dispersion value See also: plot and ruler colours 7.11 concordancing on tags The point of it... Suppose you're interested in identifying structures of a certain type (as opposed to a given word or phrase), for example sequences of . You can type in the tags you want to Noun+Noun+Noun concordance on (with or without any words). How to do it... In Concord's search-word box, type in the tags you are interested in. Or define your tags in a tag-file 141 . Examples as a singular noun (as opposed to as a verb) table table finds will find any sequence of two singular common nouns in the BNC Sampler. * * finds table if your text is tagged with < and > symbols, or if you have Note that table and [ [w NN1]table . specified ] as tag symbols, it will find 159 . There are some more examples under Search word or phrase 141 It doesn't matter whether you are using a tag file or not, since WordSmith will identify your tags automatically. (But not by magic: of course you do need to use previously tagged text to use this function.) In example 2, the asterisks are because in the BNC, the tags come immediately before the word 427 they refer to: if you forgot the asterisk, Concord would assume you wanted a tag with a separator on either side. Are you concordancing on tags? If you are asked this and your search-word or phrase includes tags, answer "Yes" to this question. If not, your search word will get " " inserted around each < or > symbol in it, as explained under December, 2015. Page 198

214 199 WordSmith Tools Manual 159 . Search Word Syntax Case Sensitivity Tags are only case sensitive if your search-word or phrase is. Search words aren't (by default). So in example 1, you will retrieve and TABLE if used as nouns (but nothing at all if no Table and table tags are in your source texts). Hide Tags? 201 After you have generated a concordance you may wish to hide the mark-up. See the View menu for this. 131 132 199 , Search See also: Overview of Tags , Showing Nearest Tags in Concord , Handling Tags 159 145 390 134 , Using Tags as Text Selectors word or phrase , Viewing the Tags , Types of Tag nearest tag 7.11.1 141 , which teaches Concord allows you to see the nearest tag, if you have specified a tag file WordSmith Tools what your preferred tags are. Then, with a concordance on screen, you'll see the tag in one of the columns of the concordance window. The point of it... The advantage is that you can see how your concordance search-word relates to marked-up text. , you can For example, if you've tagged all the speech by Robert as [Rob] and Mary as [Mary] quickly see in any concordance involving conversation between Mary, Robert and others, which ones came from each of them. and Alternatively, you might mark up your text as , : Nearest Tag will show each line like this: 1 ... could not give me the time ... 2 ... Rosemary, give me another ... 3 ... wanted to give her the help ... 4 ... would not give much for that ... 141 To mark up text like this, make up a tag file with your sections and label them as sections, as in these examples: /description "section" /description "section"

/description "section"
if you want to identify the speech of all characters in a play, and have a list of the characters, or, and they are marked up appropriately in the text file, something like this: /description "section" /description "section" December, 2015. Page 199

215 200 Concord /description "section" In cases using "section", Nearest Tag will find the section, however remote in the text file it may be. Without the keyword "section", Nearest Tag shows only the current context within the span of 225 text saved with each concordance line. 208 You can sort on the nearest tags. In the shot below, a concordance of such has been text. Some of the cases of such are tagged < PRP> ( such as ) and others computed using BNC are . The Tag column shows the nearest tag, and the whole list has been sorted using that column. 132 have If you can't see any tags using this procedure, it is probably because the Tags to Ignore the same format. For example, if Tags to Ignore has <*>, any tags such as , <quote>, etc. 141 . If so, specify the tag will be cut out of the concordance unless you specify them in a tag file file and run the concordance again. You can also display tags in colour, or even hide the tags -- yet still colour the tagged word. Here this in the BNC text with the tags in colour: is a concordance of December, 2015. Page 200</p> <p><span class="badge badge-info text-white mr-2">216</span> 201 WordSmith Tools Manual and here is a view showing the same data, with View | Hide Tags selected. December, 2015. Page 201</p> <p><span class="badge badge-info text-white mr-2">217</span> 202 Concord The tags themselves are no longer visible, and only 6 types of tag have been chosen to be viewed in colour. 132 141 131 See also: Guide to handling the BNC , Making a Tag File , Overview of Tags , Handling Tags 390 134 145 131 , Using Tags as Text Selectors , Tagged Texts , Types of Tag , Viewing the Tags 7.12 context word You may restrict a concordance search by specifying a context word which either must or may not be present within a certain number of words of your search word. as the context word. This hotel* as your search word and For example, you might have book is nearby. will only find book if hotel or hotels as your search word and paper* Or you might have book as an exclusion criterion. This will only not within your Context Search Horizons. book paper if or papers find is Context Search Horizons The context horizons determine how far Concord must look to left and right of the search word 113 is 5,5 (5 to left and 5 to when checking whether the search criteria have been met. The default right of the search word) but this can be set to up to 25 on either side. 0,2 would look only to the December, 2015. Page 202</p> <p><span class="badge badge-info text-white mr-2">218</span> 203 WordSmith Tools Manual right within two words of the search word. In this example the search-word is beautiful and the context word is lady , to be sought either left or right of beautiful . 159 Syntax is like that of the search word or phrase , * means disregard the end of the word and can be placed at either end of your context word. == means case sensitive / separates alternatives. You can specify up to 15 alternatives within an 80-character overall limit. If you want to use *, ? , == , ~ , :\ or / as a character in your search word, put it in double quotes, e.g. "*" December, 2015. Page 203</p> <p><span class="badge badge-info text-white mr-2">219</span> 204 Concord In line 14, the search-word and the context-word are in separate sentences. To avoid this, specify a suitable stop as shown here: December, 2015. Page 204</p> <p><span class="badge badge-info text-white mr-2">220</span> 205 WordSmith Tools Manual and with the same settings you will get results like these: December, 2015. Page 205</p> <p><span class="badge badge-info text-white mr-2">221</span> 206 Concord If you have specified a context word, you can re-sort on it. Also, the context words will be in their 60 own special colour . Note: the search only takes place within the current concordance line with the number of 211 . That is, if for example you choose search horizons characters defined as characters to save 25L and 25R, but only 1000 characters are saved in each line, there might not be 25 words on either side of the search-word to examine when seeking the context word or phrase if there was extensive mark-up as well. 7.13 editing concordances The point of it... You may well find you have got some entries which weren't what you expected. Suppose you have December, 2015. Page 206</p> <p><span class="badge badge-info text-white mr-2">222</span> 207 WordSmith Tools Manual SHRIMP*/PRAWN* Shrimpton in the listing. It's done a search for -- you may find a mention of Del easy to clean up the listing by simply pressing on each unwanted line. (Do a sort on the search word first so as to get all the Shrimptons next to each other.) The line will turn a light grey colour. Pressing Ins will restore it, if you make a mistake. To delete or restore ALL the lines from the current line to the bottom, press the grey - key or the grey + key by the numeric keypad. When 129 ) to the deleted you have finished marking unwanted lines, you can choose (Ctrl+Z or zap lines. 168 If you're a teacher you may want to blank out the search words: to do so, press the spacebar. Pressing the spacebar again will restore it, so don't worry! 7.13.1 remove duplicates The problem Sometimes one finds that text files contain duplicate sections, either because the corpus has become corrupted through being copied numerous times onto different file-stores or because they were not edited effectively, e.g. a newspaper has several different editions in the same file. The result can sometimes be that you get a number of repeated concordance lines. Solution Concord goes through your concordance lines and if it Edit |Remove Duplicates , If you choose 225 finds any two where the stored concordance lines are identical, regardless of the filename, 225 date etc. it will mark one of these for deletion. That is, it checks all the "characters to save " to see whether the two lines are identical. If you set this to 150 or so it is highly unlikely that false duplicates will be identified, since every single character, comma, space etc. would have to match. Check before you zap... At the end it will sort all the lines so you can see which ones match each other before you 129 decide finally to zap the ones you really don't want. 7.14 patterns When you have a collocation window open, one of the tab windows shows "Patterns". This will show the collocates (words adjacent to the search word), organised in terms of frequency within each column. That is, the top word in each column is the word most frequently found in that position. The second word is the second most frequent. December, 2015. Page 207</p> <p><span class="badge badge-info text-white mr-2">223</span> 208 Concord In R1 position (one word to the right of the search-word love ) there seem to be both intimate ( thee ) and formal ( you ) pronouns associated with love in Shakespeare. And looking at L1 position it seems that speakers talk more of their love for another than of another's love for them. The minimum frequency and length for one of the words to be shown at all, is the minimum 187 . frequency/length for collocates The point of it... The effect is to make the most frequent items in the neighbourhood of the search word "float up" to the top. Like collocation, this helps you to see lexical patterns in the concordance. 186 You can also highlight any given pattern collocate in your concordance display. re-sorting 7.15 How to do it... Sorting can be done simply by pressing the top row of any list. Or by pressing F6. Or by choosing the menu option. The point of it... The point of re-sorting is to find characteristic lexical patterns. It can be hard to see overall trends in your concordance lines, especially if there are lots of them. By sorting them you can separate out multiple search words and examine the immediate context to left and right. For example you may find that most of the entries have "in the" or "in a" or "in my" just before the search word -- sorting by the second word to the left of the search word will make this much clearer. Sorting is by a given number of words to the left or right (L1 [=1 word to the left of the search right], R2, R3, R4, R5), on the search word itself, the context word], L2, L3, L4, L5, R1 [=1 to the 199 , the distance to the nearest tag, a set category word (if one was specified), the nearest tag 168 of your own choice, or original file order (file). Main Sort December, 2015. Page 208</p> <p><span class="badge badge-info text-white mr-2">224</span> 209 WordSmith Tools Manual Th e listing can be sorted by three criteria at once. A Main Sort on Left 1 (L1) will sort the entries according to the alphabetical order of the word immediately to the left of the search word. A second sort (Sort 2) on R2 would re-order the listing by tie-breaking, that is: only where the L1 words (immediately to the left of the search word) matched exactly, and would place these in alphabetical order of the words 2 to the right of the search word. For very large concordances you may find the third sort (Sort 3) useful: this is an extra tie-breaker in cases where the second sort matches. For many purposes tie-breaking is unnecessary, and will be ignored if the "activated" box is not checked. default sort 220 This is set in the main controller settings . 168 sorting by set (user-defined categories ) You can also sort by set, if you have chosen to classify the concordance lines according to your A to Z or a to z or longer strings. The sort will put the classified own scheme, using letters from 199 See Nearest Tag details of lines first, in category order, followed by any unclassified lines. for 51 for a more sophisticated way of using the Set column. sorting by tags. See colour categories other sorts As the screenshot below shows, you can also sort by a number of other criteria, most of these accessible simply by clicking on their column header. December, 2015. Page 209</p> <p><span class="badge badge-info text-white mr-2">225</span> 210 Concord The "contextual frequency" sort means sorting on the average ranking frequency of all the words in each concordance line which don't begin with a capital letter. For this you will be asked to specify your reference corpus wordlist. The result will be to sort those lines which contain "easy" (highly frequent) words at the top of the list. All By default you sort all the lines; you may however type in for example 5-49 to sort those lines only. Ascending If this box is checked, sort order is from A to Z , otherwise it's from Z to A . 81 252 315 , KeyWords sort , Choosing Language See also: WordList sort 7.15.1 re-sorting: dispersion plot This automatically re-sorts the dispersion plot, rotating through these options: (by file-name) alphabetically in frequency order (in terms of hits per 1,000 words of running text) by first occurrence in the source text(s): text order : the gap between first and last occurrence in the source text. by range 191 see also: Dispersion Plot December, 2015. Page 210</p> <p><span class="badge badge-info text-white mr-2">226</span> 211 WordSmith Tools Manual 7.16 saving and printing You can save the concordance (and its collocates & other dependent results if these were stored when the concordance was generated) either as a Text File (e.g. for importing into a word processor) or as a file of results which you can subsequently Open (in the main menu at the top) to view again at a later date. When you leave Concord you'll be prompted to save if you haven't already done so. Saving a concordance allows you to return later and review collocates, dispersion plots, clusters. 80 You can Print using the Windows printer attached to your system. You will get a chance to specify the number of pages to print. The font will approximate the one you can see on your screen. If you use a colour printer or one with various shades of grey, the screen colours will be copied to your printer. If it is a black-and-white printer, coloured items will come in italics if your printer can do italics. prints as much of your concordance plus associated details as your printing paper Concord 97 settings allow, the edges being shown in Print Preview . If you choose to save as text using , and if you have (optionally) marked 222 out the search-word and/or context word in the Controller like this whatever you have put will get inserted in the .txt file. In the above example, doing a search through 23 Dickens texts for last night with drive as the context word, a concordance looking like this December, 2015. Page 211</p> <p><span class="badge badge-info text-white mr-2">227</span> 212 Concord produced this in the txt file: rry, tell him yourself to give him no restorative but air, and to remember my words of last night, and his promise of last night, and <CW>drive away!" The Spy withdrew, and Carton seated himself at the table, resting his forehead on his h 422 See also : using the clipboard to get a concordance into Word or another application. 7.17 sounds & video The point of it Suppose you do a concordance of "elephant" and want to hear how the word is actually spoken in context. Is the last vowel a schwa? Does the second vowel sound like "i" or "e" or "u" or a schwa? How to do it... If you have defined tags which refer to multimedia files, and if there are any such tags in the "tag- context" of a given concordance line, you can hear or see the source multimedia. The tag will be 147 to identify the file needed, if necessary downloading it from a web address, and then parsed played. In this screenshot we see a concordance where there is a tag inserted periodically in the text file. To File | Play media file , or double-click the play the media file,choose column . Tag December, 2015. Page 212</p> <p><span class="badge badge-info text-white mr-2">228</span> 213 WordSmith Tools Manual Video files can be played if the free VLC Media Player is installed (see http://www.vlcapp.com/) . The next screenshot below shows a concordance line with, in the Nearest Tag column, the mark-up saying that the source text and the video file have the same file-name (except that the latter ends .AVI and the former .TXT). A double-click on the Tag (yellow highlighted cell) brought up the video screen you can see below, and that has now played to the tenth second, then paused. You can see in the case of this particular video that there is a sub-title with the same words that are in the concordance above (though there is no guarantee you will see sub-titles for all videos). If you build up a collection of TED talks like these where the same video in English has transcripts in several languages, you can get to see the different translations: December, 2015. Page 213</p> <p><span class="badge badge-info text-white mr-2">229</span> 214 Concord by choosing View | Show related txts in the menu. 132 215 147 , Handling Tags , , Obtaining Sound and Video files See also: Multi-media Tag syntax December, 2015. Page 214</p> <p><span class="badge badge-info text-white mr-2">230</span> 215 WordSmith Tools Manual 199 198 141 Making a Tag File , Types of Tag , Showing Nearest Tags in Concord , Tag Concordancing 311 134 390 145 , Tags in WordList , Viewing the Tags , Using Tags as Text Selectors obtaining sound and video files 7.17.1 Sources of sound and video files WordSmith does not provide or include corpora. However, there are specialised corpora such as , ICE and then there are publicly available sources such as the TED Talks. You NECTE , MICASE are expected to respect copyright provisions in all cases. There is a lot of useful advice at where you will find transcripts. TED Open Translation Project These text files in English (.en), Spanish (.es), Italian (.it) and Japanese (.ja) were downloaded from 374 there and later converted using the Text Converter 147 If you wish to use a transcript and sound file format which is incompatible with the syntax 425 described here, please contact us. 7.18 summary statistics The idea is to be able to break down your concordance data. For example, you've just done a which has given you lots of singulars and lots of plurals and you consequence? concordance of want to know how many there are of each. Choose menu. Compute in the Summary Statistics December, 2015. Page 215</p> <p><span class="badge badge-info text-white mr-2">231</span> 216 Concord The searches window will at first contain a copy of what you typed in when you created the concordance. To distinguish between singular and plural, change that to and press Count; assuming that search column has Concordance selected, you will get something like this: December, 2015. Page 216</p> <p><span class="badge badge-info text-white mr-2">232</span> 217 WordSmith Tools Manual Advanced Summary Statistics features Breakdown The idea here is to be able to break down your results further, using another category in your existing concordance data, such as the files the data came from. In our example, we might , how many of the text files contained want to know for consequence and consequences each of the two forms. To generate the breakdown, activate it and choose the category you need. The results window will now show something like this came 20 times in 20 different files, the first where it is clear that the singular consequence : being file A3A.TXT . Further down you will find the results for consequences December, 2015. Page 217</p> <p><span class="badge badge-info text-white mr-2">233</span> 218 Concord which appeared 103 times in 74 files, and that in the first of these, A1E.TXT , it came twice. Cumulative column 304 see the explanation for WordList Load button 66 . see the explanation for count data frequencies text segments in Concord 7.19 A concordance line brings with it information about which segment of the text it was found in. In the screenshot below, a concordance on year was carried out; the listing has been sorted by year is found as the 3rd word of a heading. The advantage Heading Position -- in the top 2 lines, of this is that it is possible to identify search-words occurring near sentence starts, near the beginning of sections, of headings, of paragraphs. December, 2015. Page 218</p> <p><span class="badge badge-info text-white mr-2">234</span> 219 WordSmith Tools Manual 220 You can toggle the numbers between raw numbers and percentages . 146 See also: Start and end of text segments . December, 2015. Page 219</p> <p><span class="badge badge-info text-white mr-2">235</span> 220 Concord 7.20 viewing options Access these options in the main Controller, via Concord | What you see . Sort preferences ill sort a new concordance by the word to the left (L1), but you can set this to By default, Concord w 208 different values if you like. For further details, see Sorting a Concordance . Show collocate zero frequencies This toggles whether 0 or a blank (the default) is shown if a collocate frequency is zero. or December, 2015. Page 220</p> <p><span class="badge badge-info text-white mr-2">236</span> 221 WordSmith Tools Manual Concordance View You can choose different ways of seeing the data, and a whole set of choices as to what columns you want to display for each new concordance. You can re-instate any later if you wish by 87 . changing the Layout show full filename and path = sometimes you need to see the whole path but usually the filename alone will suffice. cut redundant spaces = remove any double spaces 425 show sentence only = show the context only up to its left and right sentence boundaries tag string only = show only context within two tag_string_only tags 218 show raw numbers = show the raw data instead of percentages e.g. for sentence position 168 hide search-word = blank it out eg. to make a guess-the-word exercise pad search-word with spaces = insert a space to left and right of the search-word so it stands out better 141 hide undefined tags = hide those not defined in your tag file hide tag file tags = hide all tags including undefined ones hide words = show only the tags Some of the options are visible here: December, 2015. Page 221</p> <p><span class="badge badge-info text-white mr-2">237</span> 222 Concord for example the sub-set visible shows an opportunity to blank out the search-word, to pad it with a space left & right, to shift the search-word left or right. 222 222 199 168 222 , blanking out See also : Controller What you get , showing nearest tags choices 166 . the search-word, viewing more context, growing/shrinking concordance lines WordSmith controller: Concord: settings 7.21 4 These are found in the main Controller marked Concord. 180 affect other Tools. -- may choices -- e.g. collocation horizons This is because some of the December, 2015. Page 222</p> <p><span class="badge badge-info text-white mr-2">238</span> 223 WordSmith Tools Manual When you have computed a concordance, the Concord button will have a red number (showing how many Concord windows are in use) and at the bottom of the screen you will see an icon ( ). Click that to see the list of files and their features. WHAT YOU GET and WHAT YOU SEE What you see There are 2 tabs for settings affecting What you get in the concordance and in the 220 display. There is a screenshot at Concord: viewing options showing the options under What you see . WHAT YOU GET Search Settings The search settings button lets you choose these settings: December, 2015. Page 223</p> <p><span class="badge badge-info text-white mr-2">239</span> 224 Concord Entries Wanted The maximum is more than 2 billion lines. This feature is useful if you're doing a number of searches and want, say, 100 examples of each. The 100 entries will be the first 100 found in the texts you have selected. If you search for more than 1 search-word (eg. book/ paperback ) , you will get 100 of book and 100 of paperback . entries near each other allows you to force Concord to skip hits which are too close to each other. If for example you set this to 0 or 1 and your text contains ... a lovely lovely day lovely . The default here is - then you will only get the first of these cases if searching for 1. (If you set it to 0 then you are only allowing one hit within any given word.) : this feature allows you to randomise the search. Here randomised entries Concord goes through the text files and gets the 100 entries by giving each hit a random chance of being selected. To get 100 entries Concord will have to have found around 450-550 hits with the settings shown below. You can set the randomiser anywhere from 1 in 2 to 1 in 1,000. 70 See also: reduce to N . December, 2015. Page 224</p> <p><span class="badge badge-info text-white mr-2">240</span> 225 WordSmith Tools Manual auto remove duplicates : removes any lines where the whole concordance entry matches another. (This can happen if you have a corpus where news stories get re-published in different editions by different newspapers.) Ignore punctuation between words : this allows a search for BY ITSELF to succeed where the text contains ... went by, itself Characters to save Here is where you set how many characters in a concordance line will be stored as text as the concordance is generated. The default and minimum is 1000. This number of characters will be saved when you save your results, so even if you subsequently delete the source text file you can still see some context. If you grow the lines more text will be read in (and 422 stored) as needed. There are examples here . December, 2015. Page 225</p> <p><span class="badge badge-info text-white mr-2">241</span> 226 Concord Save as text search-word or context-word marker : here you can also 211 . specify markers for your search-word and context-word Collocates Concord will compute collocates as well as the concordance, but you can set By default, 180 it not to if you like ( ). For further details, see Collocate Horizons or Minimal processing 179 Collocation The minimum frequency and length refer to the collocates to be shown in your listing. With the settings above, only collocates which occur at least 5 times and contain at least 1 188 . character will be shown as long as they don't cross sentence boundaries If separate search words is checked and you have multiple search-terms, then you get collocates distinguishing between the different search-terms. If you want them amalgamated, clear this check-box. Collocates relation statistic Mutual Choose between Specific Mutual Information, MI3, Z Score, Log Likelihood. See 289 for examples of how these can differ. Information Display WHAT YOU SEE December, 2015. Page 226</p> <p><span class="badge badge-info text-white mr-2">242</span> 227 WordSmith Tools Manual 220 . The options are explained at Concord: viewing options Columns The list offers all the standard columns: you may uncheck ones you Columns to show/hide normally do not wish to see. This will only affect newly computed KeyWords data: earlier data uses the column visibility, size, colours etc already saved. They can be altered using 87 menu option at any time. the Layout 211 158 187 , Concord Help Contents , Collocation Settings . See also: Concord Saving and Printing December, 2015. Page 227</p> <p><span class="badge badge-info text-white mr-2">243</span> WordSmith Tools Manual KeyWords Section VIII</p> <p><span class="badge badge-info text-white mr-2">244</span> 229 WordSmith Tools Manual 8 KeyWords 8.1 purpose This is a program for identifying the "key" words in one or more texts. Key words are those whose 235 . frequency is unusually high in comparison with some norm. Click here for an example The point of it... Key-words provide a useful way to characterise a text or a genre. Potential applications include: language teaching, forensic linguistics, stylistics, content analysis, text retrieval. The program compares two pre-existing word-lists, which must have been created using the WordList tool. One of these is assumed to be a large word-list which will act as a reference file. The other is the word-list based on one text which you want to study. The aim is to find out which words characterise the text you're most interested in, which is automatically assumed to be the smaller of the two texts chosen. The larger will provide background data for reference comparison. 251 247 237 Key-words and links between them can be plotted , made into a database , and grouped 241 . according to their associates Online step-by-step guide showing how 8.2 index Explanations 229 What is the Keywords program and what's it for? 245 How Key Words are Calculated 230 2-Word list Analysis 252 Key words display 251 Key words plot 249 Key words plot display 247 Plot-Links 39 Batch Analyses 237 Database of Key Key-Words 241 Associates 243 Clumps 437 Limitations Settings and Procedures 234 Calling up a Concordance December, 2015. Page 229</p> <p><span class="badge badge-info text-white mr-2">245</span> 230 KeyWords 232 Choose Word Lists 60 Colours 238 Database 431 Folders 78 Fonts 439 Keyboard Shortcuts 80 Printing 252 Re-sorting 101 Exiting Tips 244 KeyWords advice 127 Window management Definitions 425 General Definitions 236 Key-ness 240 Key key-word 243 Associate 2 See also : WordSmith Main Index 8.3 ordinary two word-list analysis KeyWords analysis. It compares the one text file (or corpus) you're chiefly The usual kind of interested in, with a reference corpus based on a lot of text. In the screenshot below we are deer hunter story interested in the key words of as the reference corpus BNC and we're using to compare with. Choose Word Lists In the dialogue box you will choose 2 files. The text file in the box above and the reference corpus file in the box below. December, 2015. Page 230</p> <p><span class="badge badge-info text-white mr-2">246</span> 231 WordSmith Tools Manual 245 254 See also How Key Words are Calculated , KeyWords Settings December, 2015. Page 231</p> <p><span class="badge badge-info text-white mr-2">247</span> 232 KeyWords 8.4 choosing files Current Text word list In the upper box, choose a word list file. To choose more than 1 word list file, press Control as you click to select non-adjacent lists, or Shift to select a range. This box determines which word-list(s) you're going to find the key words of. Reference Corpus word list 447 List. (This can be set permanently in the The the box below, you choose your Reference Corpus main Controller Settings). No word lists visible If you can't see any word lists in the displays, either change folders until you can, or go back to the WordList tool and make up at least 2 word lists: this procedure requires at least two before it can make a comparison. Swap The text you're studying must be at the top. If you get them wrong, exchange them. December, 2015. Page 232</p> <p><span class="badge badge-info text-white mr-2">248</span> 233 WordSmith Tools Manual Advanced: working with a batch file Click the browse button: and choose the batch .zip file and we are ready to make a batch: that one 2010.zip contains many thousands of word lists. December, 2015. Page 233</p> <p><span class="badge badge-info text-white mr-2">249</span> 234 KeyWords 8.5 concordance and With a key word or a word list list on your screen, you can choose Compute to call up a concordance of the currently selected word(s). The concordance will search for the same word in the original text file that your key word list came from. The point of it... is to see these same words in their original contexts. December, 2015. Page 234</p> <p><span class="badge badge-info text-white mr-2">250</span> 235 WordSmith Tools Manual example of key words 8.6 You have a collection of assorted newspaper articles. You make a word list based on these articles, and see that the most frequent word is the. Among the rather infrequent words in the list hopping , modem, squatter, grateful , etc come examples like . You then take from it a 1,000 word article and make a word list of that. Again, you notice that the most frequent word is the . So far, not much difference. You then get KeyWords to analyse the two word lists. KeyWords reports that the most "key" squatter, police, break age, council, sued, Timson, resisted, community words are: . These "key" words are not the most frequent words (which are those like ) but the words which the are most unusually frequent in the 1,000 word article. Key words usually give a reasonably good clue to what the text is about. Here is an example from the play Othello. 315 See also: word-lists with tags as prefix . 8.7 keyness 8.7.1 p value (Default=0.000001) p value is that used in standard chi-square and other statistical tests. This value ranges from 0 The to 1. A value of .01 suggests a 1% danger of being wrong in claiming a relationship, .05 would give a 5% danger of error. In the social sciences a 5% risk is usually considered acceptable. In the case of key word analyses, where the notion of risk is less important than that of selectivity, you may often wish to set a comparatively low p value threshold such as 0.000001 (one in 1 million) (1E-6 in scientific notation) so as to obtain fewer key words. Or you can set a low "maximum December, 2015. Page 235</p> <p><span class="badge badge-info text-white mr-2">251</span> 236 KeyWords 4 Controller wanted" number in the main . , under KeyWords Settings 245 is used, the computed p value will only be shown if all appropriate If the chi-square procedure statistical requirements are met (all expected values >= 5). 237 425 See also: , choosing a reference corpus Definitions 8.7.2 key-ness definition The term "key word", though it is in common use, is not defined in Linguistics. This program identifies key words on a mechanical basis by comparing patterns of frequency. (A human being, on the other hand, may choose a phrase or a superordinate as a key word.) A word is said to be "key" if it occurs in the text at least as many times as the user has specified as a Minimum a) Frequency b) its frequency in the text when compared with its frequency in a reference corpus is such that 245 the statistical probability as computed by an appropriate procedure is smaller than or equal to 235 a p value specified by the user. positive and negative keyness which is key occurs more often than would be expected by chance in positively A word comparison with the reference corpus. less often than would be expected by chance in key occurs which is word A negatively comparison with the reference corpus. typical key words KeyWords will usually throw up 3 kinds of words as "key". First, there will be proper nouns. Proper nouns are often key in texts, though a text about racing could wrongly identify as key, names of horses which are quite incidental to the story. This can be avoided by specifying a higher Minimum Frequency. Second, there are key words that human beings would recognise. The program is quite good at finding these, and they give a good indication of the text's "aboutness". (All the same, the program does not group synonyms, and a word which only occurs once in a text may sometimes be "key" for a human being. And will not identify key phrases unless you are comparing word- KeyWords 448 word clusters .) lists based on Third, there are high-frequency words like because or shall or already . These would not usually be identified by the reader as key. They may be key indicators more of style than of "aboutness". But the fact that KeyWords identifies such words should prompt you to go back to ), to investigate why such the text, perhaps with Concord (just choose Compute | Concordance words have cropped up with unusual frequencies. 245 240 425 , Definitions , Definition of Key Key-Word See also: How Key Words are Calculated , 254 KeyWords Settings December, 2015. Page 236</p> <p><span class="badge badge-info text-white mr-2">252</span> 237 WordSmith Tools Manual 8.7.3 thinking about keyness Choosing a reference corpus In general the choice does not make a lot of difference if you have a fairly small p value (such as 0.000001). But it may help to think using this analogy. Different reference corpora may give different results. Suppose you have a method for comparing objects and you take a particular apple out of your kitchen to compare using it A) with a lot of apples in the green-grocer's shop B) with all the fruit in the green-grocer's shop C) with a mixture of objects (cars, carpet, notebooks, fruit, elephants etc.) With A) you will get to see the individual characteristics, e.g. perhaps your apple is rather sweeter than most apples. (But you won't see its "apple-ness" because both your apple and all the others in your reference corpus are all apples.) With B) you will see "appleness" (your apple, like all apples but unlike bananas or pineapples, is rather round and has a very thin skin) but might not see that your apple is rather sweet and you won't get at its "fruitiness". With C) you will get at the apple's fruity qualities: it is much sweeter and easier to bite into than cars and notebooks etc. Keyness scores Is there an important difference between a key word with a keyness of 50 and another of 500? Suppose you process a text about a farmer growing 3 crops (wheat, oats and chick-peas) and suffering from 3 problems (rain, wind, drought). If each of these crops is equally important in the text, and each of the 3 problems takes one paragraph each to explain, the human reader may decide that all three crops are equally key and all three problems equally key. But in English these three crop- terms and weather-terms vary enormously in frequency (chick-peas and drought least frequent). WordSmith's KW analysis will necessarily give a higher keyness value to the rarer words. So it is generally unsafe to rely on the order of KWs in a KW list. 8.8 KeyWords database (default file extension .KDB) The point of it... The point of this database is that it will allow you to study the key-words which recur often over a number of files. For example, if you have 500 business reports, each one will have its own key words. These will probably be of two main kinds. There will be key-words which are key in one text but are not December, 2015. Page 237</p> <p><span class="badge badge-info text-white mr-2">253</span> 238 KeyWords generally key (names of the firms and words relating to what they individually produce); and other, more general words (like consultant, profit, employee ) which are typical of business I, you, should etc. come to the top if your text documentation generally. Or you may find that files are ones which are much more interactive than the reference corpus texts. By making up a database, you can sort these out. The ones at the top of the list, when you view them, may be those which are most typical of the genre in some way. We might call the ones at the top "key-key words" and the list is at first ordered in terms of "key key-ness", but those at the bottom will only be key in a few text files. You can of course toggle it into alphabetical order and back again. You can set a minimum number of files that each word must have been found to be key in, using 238 KeyWords Settings | Database . 241 When viewing a database you will be able to investigate the associates of the key key-words. Under Statistics, you will also be able to see details of the key words files which comprise the database (file name and number of key words per file), together with overall statistics on the number of different types and the tokens (the total of all the key-words in the whole database including repeats). 240 238 , Definition of key key-word See also : Creating a database 8.8.1 creating a database To build a key words database, you will need a set of key word lists. For a decent sized database, it is preferable to build it like this: 39 1. Make a batch of word lists. 39 of keyword lists. Set "faster minimal processing" on as in this shot, 2. Use this to make a batch so as to not waste time computing plots etc. December, 2015. Page 238</p> <p><span class="badge badge-info text-white mr-2">254</span> 239 WordSmith Tools Manual 3. Now, in KeyWords , choose New | KW Database . This enables you to choose the whole set of key word files. 245 Note that making a database means that only positive key words will be retained. 254 In the Controller KeyWords settings you can make other choices: minimum frequency for database If you set this to 5 you will only use for the database any KWs which appear in 5 or more texts min. KWs per text If this is set to 10, any KW results files which ended up with very few positive KWs will be ignored. December, 2015. Page 239</p> <p><span class="badge badge-info text-white mr-2">255</span> 240 KeyWords 241 See also: associates . 8.8.2 key key-word definition A "key key-word" is one which is "key" in more than one of a number of related texts. The more texts it is "key" in, the more "key key" it is. This will depend a lot on the topic homogeneity of the corpus being investigated. In a corpus of City news texts, items like , profit , companies are bank key key-words, while computer will not be, though computer might be a key word in a few City news stories about IBM or Microsoft share dealings. Requirements To discover "key key words" you need a lot of text files (say 500 or more), ideally fairly related in their topics, which you make word-lists of (it's much faster doing that in a batch), and then you have to compute key word-lists of each of those, all of which go into a database. It is all explained under 238 creating a keywords database . 236 238 245 See also: How Key Words are Calculated , Definition of Key Word , Creating a Database , 425 Definitions December, 2015. Page 240</p> <p><span class="badge badge-info text-white mr-2">256</span> 241 WordSmith Tools Manual 8.8.3 associates 237 "Associates" is the name given to key-words associated with a key key-word . The point of it... The idea is to identify words which are commonly associated with a key key-word, because they are key words in the same texts as the key key-word is. An example will help. Suppose the word wine is a key key-word in a set of texts, such as the weekend sections of newspaper articles. Some of these articles discuss different wines and their flavours, others concern cooking and refer to using wine in stews or sauces, others discuss the prices of wine in a context of agriculture and diseases affecting vineyards. In this case, the associates of would be items wine like Chardonnay, Chile, sauce, fruit, infected, soil , etc. The listing shows associates in order of frequency. A menu option allows you to re-sort them. Settings You can set a minimum number of text files for the association procedure, in the database settings 254 : Minimum texts These screenshot settings would only process those key-key-words which appear in at least 3 text December, 2015. Page 241</p> <p><span class="badge badge-info text-white mr-2">257</span> 242 KeyWords files. Statistic 289 Choose the mutual information statistic you prefer, apart from Z score which uses a span (here we're using the whole text). Minimum strength This will only show associates which reach at least the strength in the statistic set here, e.g. 3.000. This screenshot shows the most frequent associates in the right-hand column of the main keywords data base window. To see the detailed associates, double-click your chosen term in the KW column or the Associates column: 243 243 See also: definition of associate , related clusters . December, 2015. Page 242</p> <p><span class="badge badge-info text-white mr-2">258</span> 243 WordSmith Tools Manual associate definition An "associate" of key-word X is another key-word (Y) which co-occurs with X in a number of texts. It may or may not co-occur in proximity to key-word X. (A collocate would have to occur within a given distance of it, whereas an associate is "associated" by being key in the same text.) Guardian newspaper text, wine was found to be a key word For example, in a key-word database of 240 key key word in 25 out of 299 stories from the Saturday "tabloid" page, thus a in this section. The top associates of wine were: wines, Tim, Atk in, dry, le, bottle, de, fruit, region, chardonnay, red, . producers, beaujolais It is strikingly close to the early notion of "collocate". Association operates in various ways. It can be strong or weak, and it can be one-way or two-way. to and fro is one-way ( to is nearly always found near fro but it For example, the association between fro near to ). is rare to find 289 425 241 236 See also: Definition of Key Word , Associates , Definitions , Mutual Information keywords database related clusters 8.8.4 The idea is to be able to find any overlapping clusters in a key word database, e.g. where MY LORD is related to MY LORD YOUR SON. To achieve this, choose Compute | Associates . To clear the view, . Compute | Clusters 241 See also: associates 8.8.5 clumps 237 241 "Clumps" is the name given to groups of key-words associated with a key key-word . The point of it (1)... The idea here is to refine associates by grouping together words which are found as key in the same sub-sets of text files. The example used to explain associates will help. wine is a key key-word in a set of texts, such as the weekend sections of Suppose the word newspaper articles. Some of these articles discuss different wines and their flavours, others concern cooking and refer to using wine in stews or sauces, others discuss the prices of wine in a wine would context of agriculture and diseases affecting vineyards. In this case, the associates of be items like Chardonnay, Chile, sauce, fruit, infected, soil , etc. The associates procedure shows all such items unsorted. The clumping procedure, on the other hand, attempts to sort them out according to these different December, 2015. Page 243</p> <p><span class="badge badge-info text-white mr-2">259</span> 244 KeyWords uses. The reasoning is that the key words of each text file give a condensed picture of its "aboutness", and that "aboutnesses" of different texts can be grouped by matching the key word lists. Thus sets of key words can be clumped together according to the degree of overlap in the key word lexis of each text file. Two stages The initial clumping process does no grouping : you will simply see each set of key-words for 244 group clumps , you may simply join those you think belong each text file separately. To together (by dragging), or regroup with help by pressing . The listing shows clumps sorted in alphabetical order. You can re-sort by frequency (the number of times each key word in the clump appeared in all the files which comprise the clump). 244 243 See also: , regrouping clumps definition of associate regrouping clumps How to do it You can simply join by dragging, where you think any two clumps belong together because of semantic similarity between their key-words. will inform you which two clumps match best. You'll see a list of the Or if you press KeyWords , words found only in one, a list of the words found only in the other, and (in the middle) a list of the words which match. It's up to you to judge whether the match is good enough to form a merged clump. If you aren't sure, press . Cancel If you do want to join them, press Join . want to join them and don't want KeyWords to suggest this pair again, If you're sure you don't . You can tell Skip press KeyWords to skip up to 50 pairs. To clear the memory of the items to be . skipped, press Clear Skip The point of it (2)... 417 (1997) shows how clumping reveals the different perceived roles of women in a set of Scott Guardian features articles. 243 clumps See also: KeyWords: advice 8.9 Don't call up a plot of the key words based on more than one text file. It doesn't make sense! 1. Anyway the plot will only show the words in the first text file. If you want to see a plot of a certain 191 . word or phrase in various different files, use Concord dispersion 2. There can be no guarantee that the "key" words are "key" in the sense which you may attach to "key". An "important" word might occur once only in a text. They are merely the words which are outstandingly frequent or infrequent in comparison with the reference corpus. Compare apples with pears, or, better still, Coxes with Granny Smiths. So choose your 3. 237 reference corpus in some principled way . The computer is not intelligent and will try to do whatever comparisons you ask it to, so it's up to you to use human intelligence and avoid comparing apples with phone boxes! December, 2015. Page 244</p> <p><span class="badge badge-info text-white mr-2">260</span> 245 WordSmith Tools Manual If it didn't work... 81 defined for each For the procedure to work, a number of conditions must be right: the language word list must be the same (that is, Mexican Spanish and Iberian Spanish count as the same but Iberian Spanish and Brazilian Portuguese count as different so could not be compared in this 315 process); each word list must have been sorted alphabetically in ascending order before the 315 comparison is made. (The program tries to ensure this, automatically.) Also, any prefixes or suffixes must match. 8.10 KeyWords: calculation The "key words" are calculated by comparing the frequency of each word in the word-list of the text you're interested in with the frequency of the same word in the reference word-list. All words which 120 appear in the smaller list are considered, unless they are in a stop list . If the occurs say, 5% of the time in the small word-list and 6% of the time in the reference corpus, it will not turn out to be "key", though it may well be the most frequent word. If the text concerns the anatomy of spiders, it may well turn out that the names of the researchers, and the items spider, , etc. may be more frequent than they would otherwise be in your reference corpus leg, eight (unless your reference corpus only concerns spiders!) To compute the "key-ness" of an item, the program therefore computes its frequency in the small word-list 297 in the small word-list the number of running words its frequency in the reference corpus 297 the number of running words in the reference corpus and cross-tabulates these. Statistical tests include: the c lassic chi-square test of significance with Yates correction for a 2 X 2 table 417 Lo g Likelihood test, which gives a better estimate of keyness, especially Ted Dunning's when contrasting long texts or a whole genre against your reference corpus. for more on these. See UCREL's log likelihood site A word will get into the listing here if it is unusually frequent (or unusually infrequent) in comparison with what one would expect on the basis of the larger word-list. Unusually infrequent key-words are called "negative key-words" and appear at the very end of your listing, in a different colour. Note that negative key-words will be omitted automatically from a 237 and a plot. keywords database Words which do not occur at all in the reference corpus are treated as if they occurred 5.0e-324 December, 2015. Page 245</p> <p><span class="badge badge-info text-white mr-2">261</span> 246 KeyWords times (0.0000000 and loads more zeroes before a 5) in such a case. This number is so small as not to affect the calculation materially while not crashing the computer's processor. 8.11 KeyWords clusters What is it? A KeyWords cluster, like a WordList cluster, represents two or more words which are found repeatedly near each other. However, a KeyWords cluster only uses key words. A screenshot will help make things clearer. This is a key words list based on a piece of transcript from a Wallace and Gromit film, using the BNC as the reference corpus. The clusters tab below shows us something like this: December, 2015. Page 246</p> <p><span class="badge badge-info text-white mr-2">262</span> 247 WordSmith Tools Manual GROMIT OH GROMIT The frequency 3 in the line means that there are 3 cases where the key-word OH in that text. is found within the current collocation span of means that there is typically one [.] intervening word or [..] two intervening words as in this case shown from the source text. Requirements The procedure is text-oriented. You can only get a keywords cluster list if there is exactly one and source text. Note that for this procedure sentence boundaries are not blocked, so Gromit Ah Oh intervening. can be considered to have one word 251 See also: Plot calculation . 8.12 KeyWords: links The point of it... is to find out which key-words are most closely related to a given key-word. 251 plot will show where each key word occurs in the original file. It also shows how many links A there are between key-words. What are links? December, 2015. Page 247</p> <p><span class="badge badge-info text-white mr-2">263</span> 248 KeyWords Links are "co-occurrences of key-words within a collocational span". An example is much easier to understand, though: elephant is key in a text about Africa, and that water is also a key word in the Suppose the word elephant and water occur within a span of 5 words of each other, they are said to be same text. If "linked". The number of times they are linked like this in the text will be shown in the Links window. The link spans (like collocation horizons) go from 1 word away to up to 25 words to left and right. 113 is 1 to 5 . The default What you see This is a key words list based on Romeo and Juliet, using all the 37 Shakespeare plays as the reference corpus. This Links window shows a number of key words followed by the number of linked types (11 here) the total number of hits of the key word ( ROMEO ) and then the individual linked key words. You can if you wish double-click in the Link ed KWs column and you will see the details listed: December, 2015. Page 248</p> <p><span class="badge badge-info text-white mr-2">264</span> 249 WordSmith Tools Manual has 11 linked words; it's linked 23 times with THOU , 15 times with O ROMEO , etc. A right-click menu lets you copy or print these details. Requirements The procedure is text-oriented. You can only get a keywords links list if there is exactly one source text. 251 plot listing Double-click on any word in the to call up a window which show the linked key- words. 246 116 251 , Source Text , KeyWords clusters See also: Plot calculation 8.13 make a word list from keywords data to save your data as a word list (for later With a key word list on your screen, you can press comparison, etc. using WordList functions). 8.14 plot display The plot will give you useful visual insights into how often and where the different key words crop up 252 in the text. The plot is initially sorted to show which crop up more at the beginning (e.g. in the introduction) and then those from further in the text. The following screenshot shows KWs of the play Romeo and Juliet , revealing where each term Tybalt occurs. The name , for example, occurs in a main burst about half way through the text. December, 2015. Page 249</p> <p><span class="badge badge-info text-white mr-2">265</span> 250 KeyWords re-sorting 252 Click the header to the listing or use the menu option . The Key word column sorts re-sort alphabetically, the dispersion column sorts on the amount of dispersion (higher numbers mean the occurrences are more spread out); the keyness column is the original plot order, or you can sort on number of links with other KWs or on the number of hits found. plot data You can view the plot data as numbers by double-clicking. Here is the view if one double-clicks on the yellow area: The first column gives the word-numbers and the second the percentage of the way through the text. Right-click on this window to copy or print. December, 2015. Page 250</p> <p><span class="badge badge-info text-white mr-2">266</span> 251 WordSmith Tools Manual links 247 links This shows the total number of between the key-word and other key-words in the same 113 = 5,5). That is, how many times was each key- default text, within the current collocation span ( word found within 5 words of left or right of any of the other key-words in your plot. hits This column is here to remind you of how many occurrences there were of each key-word. When you have obtained a plot, you can then see the way certain words relate to others. To do this, look at the Links window in the tabs at the bottom, showing which other key words are most linked 247 to the word you clicked on. That is, which other words occur most often within the collocation horizons you've set. The Links window should help you gain insights into the lexical relations here. Each plot window is dependent on the key words listing from which it was derived. If you close that Save option because the plot comes from a key it. There's no Print down, it will disappear. You can 102 As . There's no save as text option because the Save , or words listing which you should Save plot has graphics, which cannot adequately be represented as text symbols, but you can Copy to 422 the clipboard (Ctrl+C) and then paste it into a word processor as a graphic. Alternatively, use option, which saves your plot data (each word is followed by the total Output | Data as Text File the number of words in the file, then the word number position of each occurrence). 441 The ruler in the menu ( ) allows you to see the plot divided into 8 equal segments if based on one text, or the text-file divisions if there is more than one. 251 446 See also: Key words plot , plot dispersion value 8.14.1 plot calculation The point of it... is to see where the key words are distributed within the text. Do they cluster around the middle or near the beginning of the text? How it's done This will calculate the inter-relationships between all the key words identified so far, excluding any 129 . which you have deleted or zapped 1. it does a concordance on the text finding all occurrences of each key word; 2. it then works out which of each of the other key words appear within the collocation horizons (set in Settings). It uses the larger of the two horizons. 3. it then plots all the words showing where each occurrence comes in the original file (with a "ruler" showing how many words there are in each part of the file). 4. it computes how many other key-words co-occurred with it, within the current collocational span. 446 . 5. it computes a plot dispersion value 430 Note: this process depends on KeyWords being able to find the source texts which your original word-list was based on. 102 and make other graphs, as explained under Save As You may find it useful to export your plot 102 . December, 2015. Page 251</p> <p><span class="badge badge-info text-white mr-2">267</span> 252 KeyWords 247 249 See also: Plot Links , Key words plot display 8.15 re-sorting: KeyWords How to do it... Sorting can be done simply by pressing the top row of any list. Or by pressing F6 or Ctrl+F6. Or by choosing the menu option. Press again to toggle between ascending & descending sorts. the different sorts key words list offers a choice between sorting by A est words appear at the top) (the key-ness k ey (from A to Z) alphabetical order frequency in the smaller list (the most frequent words come first) (the most frequent words come first) frequency in the reference list rotates between sorting by A key words plot k ey (the key-ness est words appear at the top) alphabetical order (from A to Z) frequency (words which appear oftenest come first) number of links (the most linked words come first) first mention of each key word in the text (words used in smallest sections of text come first) range key key words database A toggles between sorting by frequency (the most k ey k ey words appear at the top) alphabetical order (from A to Z) 241 list Associates An toggles between sorting by frequency (association between title-word and item) alphabetical order (from A to Z) frequency (association between item and title-word) 8.16 the key words screen The display shows each key word · · its frequency in the source text(s) which these key words are key in. (Freq. column below) the % that frequency represents · · the number of texts it was present in its frequency in the reference corpus (RC. Freq. column) · · the reference corpus frequency as a % December, 2015. Page 252</p> <p><span class="badge badge-info text-white mr-2">268</span> 253 WordSmith Tools Manual 245 keyness (chi-square or log likelihood statistic · ) 235 · p value 270 (any which have been joined to each other) lemmas · 168 · the user-defined set 245 used. The The calculation of how unusual the frequency is, is based on the statistical procedure statistic appears to the right of the display. If the procedure is log likelihood, or if chi-square is used and the usual conditions for chi-square obtain (expected value >= 5 in all four cells) the probability (p) will be displayed to the right of the chi-square value. The criterion for what counts as "outstanding" is based on the minimum probability value selected before the key words were calculated. The smaller the number, the fewer key words in the display. Usually you'll not want more than about 40 key words to handle. 252 according to how outstanding their frequencies of occurrence are. The words appear sorted Those near the top are outstandingly frequent. At the end of the listing you'll find any which are 236 outstandingly infrequent (negative keywords), in a different colour. There is no upper limit to the keyness column of a set of key words. It is not necessarily sensible to assume that the word with the highest keyness value must be the most outstanding, since keyness is computed merely statistically; there will be cases where several items are obviously equally key (to the human reader) but the one which is found least often in the reference corpus and most often in the text itself will be at the top of the list. Source text 116 (s). As its name suggests, choosing the source text tab gets you to a view of the source text December, 2015. Page 253</p> <p><span class="badge badge-info text-white mr-2">269</span> 254 KeyWords 8.17 WordSmith controller: KeyWords settings 4 marked These are found in the main Controller KeyWords. This is because some of the choices may affect other Tools. KeyWords and WordList both use similar routines: KeyWords to calculate the key words of a text file, and WordList when comparing 260 word-lists . WHAT YOU GET Procedure 245 Chi-square or Log Likelihood. The default is Log Likelihood. See procedure for further details. Max. p value 235 The default level of significance. See p value for more details. December, 2015. Page 254</p> <p><span class="badge badge-info text-white mr-2">270</span> 255 WordSmith Tools Manual Max. wanted (500), Min. frequency (3), Min. % of texts (5% ) You may want to restrict the number of key words (KWs) identified so as to find for example the ten most "key" for each text. The program will identify all the key words, sort them by 236 over key-ness, and then throw away any excess. It will thus favour positive key words negative ones. The minimum frequency is a setting which will help to eliminate any words or clusters which are unusual but infrequent. For example, a proper noun such as the name of a village will usually be extremely infrequent in your reference corpus, and if mentioned only once in the text you're analysing, it is likely not to be "key". The default setting of 3 mentions as a minimum helps reduce spurious hits here. In the case of short texts, less than 600 words long, a minimum of 2 will automatically be used. The minimum percentage of texts (default = 5%) allows you to ignore words which are not found in many texts. Here the percentage is of the text files in the set you are comparing against a reference corpus. If you're comparing a word-list based on one text, each word in it will occur in 100% of the texts and thus won't get ignored. If you compare a word-list based on 200 texts against your reference corpus, the default of 5% would mean that only words which 252 occur in at least 10 of those texts will be considered for keyness. The KeyWords display shows the number of texts each KW was found in. (If you see ?? that is because the data were computed before that facility came into WordSmith.) Exclude negative KWs If this is checked, KeyWords will not compute negative key words (ones which occur in frequently). significantly Minimal processing 246 247 251 or KW clusters If this is checked, KeyWords will not compute plots , links as it computes the key words (they can always be computed later assuming you do not move or delete the original text files). This is useful if computing a lot of KW files in a batch, eg. to make a database. Full lemma processing If this is checked (the default), KeyWords will compute the full frequency in the case of 270 items. For example if alone had a lemmatised WENT, GOES etc. and represents GO GO etc. totalled 100, then its frequency will GO, WENT, GONE frequency of 10 but the whole set would count only 10. GO be counted as 100. If unchecked, Max. link frequency To compute a plot is hard work as all the KWs have to be concordanced so as to work out where they crop up. To compute links between each KW is much harder work again and can take time especially if your KWs include some which occur thousands or hundreds of times in the text. To keep this process more manageable, you can set a default. Here 2000 means that any KW which occurs more than 2000 times in the text will not be used for computing 247 . (It will still appear in the plots and list of KWs, of course.) links WHAT YOU SEE December, 2015. Page 255</p> <p><span class="badge badge-info text-white mr-2">271</span> 256 KeyWords Columns The Columns to show/hide list offers all the standard columns: you may uncheck ones you normally do not wish to see. This will only affect newly computed KeyWords data: earlier data uses the column visibility, size, colours etc already saved. They can be altered using the 87 Layout menu option at any time. DATABASE Database: minimum frequency 237 . The default is 1. See database Database: associate minimum texts 241 The default is 5. See associates . 229 245 See also: KeyWords Help Contents , KeyWords calculation . December, 2015. Page 256</p> <p><span class="badge badge-info text-white mr-2">272</span> WordSmith Tools Manual WordList Section IX</p> <p><span class="badge badge-info text-white mr-2">273</span> 258 WordList 9 WordList 9.1 purpose This program generates word lists based on one or more plain text files. The word lists are automatically generated in both alphabetical and frequency order, and optionally you can generate 276 list too. a word index The point of it... These can be used simply in order to study the type of vocabulary used; 1 448 to identify common word clusters ; 2 to compare the frequency of a word in different text files or across genres; 3 between different to compare the frequencies of cognate words or translation equivalents 4 81 ; languages 234 to get a concordance 5 of one or more of the words in your list. 260 262 Within WordList you can compare two lists or , or carry out consistency analysis (simple 263 detailed ) for stylistic comparison purposes. 229 These word-lists may also be used as input to the KeyWords program, which analyses the words in a given text and compares frequencies with a reference corpus, in order to generate lists of "key-words" and "key-key-words". 278 Word lists don't have to be of single words, they can be of clusters . 318 See also: WordList display Online step-by-step guide showing how 9.2 index Explanations 258 What is WordList and what does it do? 260 Comparing Word-lists 261 Comparison Display 262 Consistency Analysis (Simple) 263 Consistency Analysis (Detailed) 425 Definitions 298 Detailed Statistics 270 Lemmas December, 2015. Page 258</p> <p><span class="badge badge-info text-white mr-2">274</span> 259 WordSmith Tools Manual 437 Limitations 66 Summary Statistics 92 Match List 289 Mutual Information 315 Sort Order 120 Stop Lists 303 Type/token Ratios Procedures 273 Auto-Join 39 Batch Processing 234 Calling up a Concordance 44 Choosing Texts 60 Colours 63 Computing a new variable 431 Folders 72 Editing Entries 113 Editing Filenames 439 Keyboard Shortcuts 101 Exiting 78 Fonts 314 Minimum & Maximum Settings 294 Mutual Information Score Computing 80 Printing 315 Re-sorting a Word List 101 Saving Results 109 Searching for an Entry by Typing 288 Searching for Entry-types using Menu 278 Single Words or Clusters 124 Text Characteristics 276 Word Index 129 Zapping entries 2 318 See also: WordSmith Main Index , WordList display 9.3 compare word lists 9.3.1 compute key words With a word list visible in the WordList tool, you may choose Compute | KeyWords to get a keywords analysis of the current word list. This will assume you will wish to use the reference 447 254 for comparison. corpus defined in the settings You will see the results in one of the tabs at the bottom of the screen. December, 2015. Page 259</p> <p><span class="badge badge-info text-white mr-2">275</span> 260 WordList As in the KeyWords tool, this procedure compares all the words in your original word list with those in the reference corpus but does not inform you about words which are only found in the reference corpus. 260 315 , word-list with tags as prefix See also : Compare two wordlists 9.3.2 comparing wordlists The idea is to help stylistic comparisons. Suppose you're studying several versions of a story, or and another has assassinate , you can use this different translations of it. If one version uses k ill function. all the words in both lists and will report on all those which appear The procedure compares significantly more often in one than the other, including those which appear more than a minimum number of times in one even if they do not appear at all in the other. How to do it 1. Open a word list. File | Compare 2 wordlists . 2. In the menu, choose 3. Choose a word list to compare with. You will see the results in one of the tabs at the bottom of the screen. 4 tab , The minimum frequency (which you can alter in the Controller KeyWords Settings ) can be set to 1. If it is raised to say 3, the comparison will ignore words which do not appear at least 3 times in at least one of the two lists. 235 from 0.1 to 0.000001 or what you will). The Choose the significance value (all, or a p value 235 smaller the p value , the more selective the comparison. In other words, a p setting of 0.1 will show more words than a p setting of 0.0001 will. 229 261 format is similar to that used in KeyWords . You will also find the Dice coefficient The display 29 433 . which compares the vocabularies of the two texts, reported in the Notes 259 92 263 , Match List , Consistency Analysis See also: Compute Key Words December, 2015. Page 260</p> <p><span class="badge badge-info text-white mr-2">276</span> 261 WordSmith Tools Manual 9.3.3 comparison display 260 by choosing compare two wordlists How to get here? Here is a comparison window, where we have compared Shakespeare's King Lear with Romeo and Juliet. The display shows King Lear, (with % if > 0.01%) -- then, to the right frequency in the text you started with, here frequency in the other text, here (with % if > 0.01%) -- then, to the right Romeo & Juliet, 235 245 chi-square or log likelihood , and p value . The criterion for what counts as "outstanding" is based on the minimum probability value entered before the lists were compared. The smaller this probability value the fewer words in the display. The words appear sorted according to how outstanding their frequencies of occurrence are. Those near the top are outstandingly frequent in your main word-list. At the end of the listing you'll find those which are outstandingly infrequent in the first text chosen: in other words, key in the second text. 229 This comparison is similar to the analysis of "key words" in the KeyWords program. The KeyWords analysis is slightly quicker and allows for batch processing. The word is the most key of all, it scores 75 in the keyness column. King At the bottom we see the words of King Lear which are least key in comparison with the play . Romeo and Juliet December, 2015. Page 261</p> <p><span class="badge badge-info text-white mr-2">277</span> 262 WordList 9.4 merging wordlists The point of it You might want to merge 2 word lists (or concordances, mutual information lists etc.) with each other if making each one takes ages or if you are gradually building up a master word list or concordance based on a number of separate genres or text-types. How to do it With one word-list (or concordance) opened, choose File | Merge with and select another. Be aware that... Making a merged word list implies that each set of source texts was different. If you choose to merge 2 word lists both of which contained information about the same text file, WordSmith will do as you ask even though the information about the number of occurrences and of texts in which each word-type was found is (presumably) inaccurate. Merging a list in English with another in Spanish: if you start with the one in Spanish, the one in English will be merged in and henceforth treated as if it were Spanish, eg. in sort order. Presumably if you try to merge one in English with one in Arabic (I've never tried) you should see all the forms but you would get different results merging the Arabic one into the English one (all the Arabic words would be treated as if they were English). 9.5 consistency 9.5.1 consistency analysis (range) This function (termed "range" by Paul Nation ) comes automatically with any word-list. In any word-list you will see a column headed "Texts". This shows the number of texts each word occurred in (the maximum here being the total number of text-files used for the word-list). December, 2015. Page 262</p> <p><span class="badge badge-info text-white mr-2">278</span> 263 WordSmith Tools Manual The point of it... The idea is to find out which words recur consistently in lots of texts of a given genre. For was found to occur in many of a set of business Annual Reports. example, the word consolidate It did not occur very often in each of them, but did occur much more consistently in the business reports than in a mixed set of texts. Naturally, words like the are consistent across nearly all texts in English. (While working on a set of word lists to compare with business reports, I found one text without . I also discovered that the one of my texts was in Italian: but this wasn't the one without the ! The culprit was an election results list, which contained lots of instances of Cons., Lab. and place names, but no instances of the .) the To analyse common grammar words like , a consistency list may be very useful. Even so, you're likely to find some common lexical items recur surprisingly consistently. To eliminate the commonly consistent words and find only those which seem to characterise your genre or sub-genre, you need to find out which are significantly consistent. Save your word list, 260 then use it for comparison with others in WordList, or using KeyWords. This way you can determine which are the significantly consistent words in your genre or sub-genre. 263 260 92 See also: Consistency Analysis (Detailed) , Comparing Word-lists , Match List detailed consistency analysis 9.5.2 262 This function does exactly the same thing as simple consistency , but provides much more detail. The point of it... The idea is to help stylistic comparisons. Suppose you're studying several versions of a story, or different translations of it. This function enables you to see all the words which are used in the word lists which you have called up. The Total column shows how many instances of each word occurred overall, Texts shows how December, 2015. Page 263</p> <p><span class="badge badge-info text-white mr-2">279</span> 264 WordList many text-files it came in. Then there are two columns (No. of Lemmas, and Set which behaves as after occurred in all 37 in a word-list) and then a column for each text. In this case, the word texts, it occurred 393 times in all, and it was most frequent in all's well that ends well at 18 occurrences. Statistics and filenames can be seen for the set of 37 Shakespeare plays used 29 here by clicking on the tabs at the bottom. Notes can be edited and saved along with the detailed consistency list. There is no limit except the limit of available memory as to how many text files you can process in this procedure. You can set a minimum number of texts and a minimum overall frequency in the 311 . WordList settings in the Controller How to do it... New...( ) In the window you see when you press you will be offered a tab showing detailed consistency. To choose more than 1, use Control or Shift as you click. Below I have chosen five out of 6 available. (These are versions of Red Riding Hood.) December, 2015. Page 264</p> <p><span class="badge badge-info text-white mr-2">280</span> 265 WordSmith Tools Manual Initially they may come in the wrong order: December, 2015. Page 265</p> <p><span class="badge badge-info text-white mr-2">281</span> 266 WordList so adjust with the two buttons at the right. and now press compute Detailed Consistency now . Settings 311 You can require a minimum number of texts and minimum frequency in the main Controller if you click this. Sorting Each column can be sorted by clicking on its header column ( etc.). When working Word, Freq. on Shakespeare plays, to get the words which occurred in all 37 to the top, I clicked Texts . December, 2015. Page 266</p> <p><span class="badge badge-info text-white mr-2">282</span> 267 WordSmith Tools Manual Row percentages If you choose to Show as % , you will transform the view so as to get row percentages. In this screenshot, December, 2015. Page 267</p> <p><span class="badge badge-info text-white mr-2">283</span> 268 WordList we see the last few items which appear only in Anthony and Cleopatra, then Cleopatra (93.3%), Egypt (93.18%) etc. (Egypt appears also in A Midsummer Night's Dream, As You Like It, KIng Henry VIII.) 262 268 See also: Detailed Consistency Relations , Comparison , Consistency Analysis (range) 62 261 260 92 , Match List , Column Totals , Comparing Word-lists Display re-sorting: consistency lists The frequency-ordered consistency display can be re-sorted by order (Word) alphabetical frequencies overall (Total, the default) total frequencies in any given file (you see the file names). by the Click on Word, Total or a filename to choose. The sort can be either ascending or descending, the default being descending. 315 See also: Sorting word-lists 9.5.3 detailed consistency relations 263 such as this, of five versions of the fairy story With a detailed consistency list Little Red Riding , Hood detailed ). If you click the red5.lst it looks as if the most long-winded story is probably version 5 ( cons. relation tab you can see the relevant statistics more usefully: December, 2015. Page 268</p> <p><span class="badge badge-info text-white mr-2">284</span> 269 WordSmith Tools Manual where it can be seen that red5 has a type-count of 462 words, more than any other, and that the relation between red2 and red3 is the closest with a relation statistic of 0.487. 433 This relation is the Dice coefficient , based on the joint frequency and the type-counts of the two 426 texts. Type count is the number of different word types in each text. Joint frequency: there are 138 matches in the vocabulary of these two versions, which means that 138 distinct word types book appeared 20 times in one list and 3 times in matched up in the two word lists. (If for example the other, that would count as 1 match.) A Dice coefficient ranges between 0 and 1. The 0.487 can be thought of like a percentage, i.e. there's about a 49% overlap between the vocabularies of the two versions of the same story. 263 . See also : Detailed Consistency 9.6 find filenames If you have an index-based word list on screen you can see how many text files each word was occurs in 7 of found in. For example, in this index based on Shakespeare plays, EYES AND EARS the 37 plays. which of those plays? What if you want to know December, 2015. Page 269</p> <p><span class="badge badge-info text-white mr-2">285</span> 270 WordList Select the word(s) or cluster(s) you're interested in and choose File | Find Files in the menu and you will get something like this: 111 116 276 , making a WordList index See also : source texts , selecting multiple entries 9.7 Lemmas (joining words) what are lemmas and how do we join words? 9.7.1 In a word list, a key word list or a list of collocates you may want to store several entries together: Bringing them together means you're treating them as e.g. want; wants; wanting; wanted. members of the same "lemma" or set -- rather like a headword in a dictionary. A lemmatised head entry has a red mark in the left margin beside it. The others you marked will be coloured as if deleted. The linked entries which have been joined to the head can be seen at the right. 278 we see a word list based on 3-word clusters Here had a where originally a good deal and thereby risen to and a great deal frequency of 24, but has been joined to a good few 141. 273 271 . or manually Joining can be done automatically December, 2015. Page 270</p> <p><span class="badge badge-info text-white mr-2">286</span> 271 WordSmith Tools Manual View all the various lemma forms Double-click on the Lemmas column as in the shot below, and a window of Lemma Forms will open up, showing the various components. Get rid of the deleted words If you don't want to see the deleted words 129 them. choose Ctrl-Z to zap 111 274 273 , See also: Auto-Joining methods , Using a text file to lemmatise , selecting multiple entries 187 Concord lemmatisation manual joining 9.7.2 Manual joining You can simply do this by dragging one entry to another. Suppose your word list has WANT WANTED WANTING December, 2015. Page 271</p> <p><span class="badge badge-info text-white mr-2">287</span> 272 WordList you can simply grab wanting or wanted with your mouse and place it on want . 274 (See choosing lemma file if you want to join these to a word which isn't in the list) Can't see the word to join to? If you cannot see all the items you want to join in one screen, you can do the same thing using by 112 . marking 1. Use Alt+F5 to mark an entry for joining to another. The first one you mark will be the "head". For the moment, while you're still deciding which other entries belong with it, the edge of that row will be marked green. Any entries which you then decide to link with the head (by again pressing Alt+F5) will show they're marked too, in white. (If you change your mind you can press Shift+Alt+F5 and the marking will disappear.) 2. Use F4 to join all the entries which you've marked. The program will then put the joint frequencies 112 marked of all the words you've marked with the frequency of the one you first (the head). Alternatively, 1. select the head word, this makes it visible in the status bar. 2. Find the word you want to join and drag it to the status bar . December, 2015. Page 272</p> <p><span class="badge badge-info text-white mr-2">288</span> 273 WordSmith Tools Manual To Un-join If you select an item which has lemmas visible at the right and press Ctrl+F4, this will unjoin the Edit entries of that one lemma. To unjoin all lemmatised forms in the entire list, in the menu choose . | Join | Unjoin All 9.7.3 auto-joining lemmas There are two methods, a) based on a list, and b) based on a template. a) File-based joining 274 which automates the matching & joining process. The text file You can join up lemmas using a ( ) in actual processing of the list takes place when you choose the menu option Match Lemmas WordList, Concord or KeyWords. Every entry in your lemma list will be checked to see whether it matches one of the entries in your word list. In the example, if, say, am, was , and were are found, be . If go and went they will be stored as lemmas of went will be joined to go . are found, then b) Auto-joining based on a template he menu Or you can auto-join any of the entries in your current word list which meet your criteria: t Auto-Join can be used to specify a string such as S or S;ED;ING and will then go through option S or the whole word list, lemmatising all entries where one word only differs from the next by having or ING on the end of it. (Use ; to separate multiple suffixes.) ED Prefix / Suffix / Infix By default all strings typed in are assumed to be suffixes; to join prefixes put an asterisk ( * ) at the right end of the prefix. If you want to search for infixes (eg. bloody in absobloodylutely [languages like Swahili use infixes a lot]) put an asterisk at each end. Examples and will join book, booked to book to booking to book S;ED;ING books *S;*ED;*ING books to book, booked to book and booking to book will join UN*;ED;ING will join undo to do, booked to book and booking to book *BLOODY* absobloodylutely to absolutely will join The process can be left to run quickly and automatically, or you can have it confirm with you before joining each one. Automatic lemmatisation, like search-and-replace spell-checking, can produce oddities if just left to run! To stop in the middle of auto-joining, press Escape. Tip With a previously saved list, try auto-joining without confirming the changes (or choose Yes to All during it). Then choose the Alphabetical (as opposed to Frequency) version of the list and sort on Lemmas (by pressing the Lemmas column heading). You will see all the joined entries 270 at the top of the list. It may be easier to Unjoin (Ctrl+F4) any mistakes than to confirm each one... Finally, sort on the Word and save. December, 2015. Page 273</p> <p><span class="badge badge-info text-white mr-2">289</span> 274 WordList 270 See also: Lemmatisation 9.7.4 choosing lemma file The point of it... You may choose to lemmatise all items in the current word-list using a standard text file which groups words which belong together ( be -> was, is, were , etc.). While it is time-consuming producing the text file the first time, it will be very useful if you want to lemmatise lots of word lists, 273 and is much less "hit-and-miss" than auto-joining using a template. here is an English-language lemma list from Yasumasa Someya at http://lexically.net/downloads/ T BNC_wordlists/e_lemma.txt . How to do it Lemma list settings are accessed via the Lists option in the WordList menu or an Advanced Settings button in the Controller December, 2015. Page 274</p> <p><span class="badge badge-info text-white mr-2">290</span> 275 WordSmith Tools Manual followed by Choose the appropriate button (for Concord, KeyWords or WordList) and type the file name or browse for it, then Load it. The file should contain a plain text list of lemmas with items like this: BE -> AM, ARE, WAS, WERE, IS GO -> GOES, GOING, GONE, WENT WordSmith then reads the file and displays them (or a sample if the list is long). The format allows any alphabetic or numerical characters in the language the list is for, plus the single apostrophe, that line won't be included space, underscore. In other words, if you mistakenly put GO = GOES because of the = symbol. The actual processing of the list will take place when you compute your word list, key word list or concordance or when you choose the menu option Match Lemmas ( ) in WordList, Concord or 92 KeyWords. See Match List for a more detailed explanation, with screenshots. Lemmatising 120 is processed. occurs before any stop list What if my text files don't contain the headword of the lemma? December, 2015. Page 275</p> <p><span class="badge badge-info text-white mr-2">291</span> 276 WordList AM, ARE BE as in the list above, but your texts don't actually Suppose you are matching etc with BE contain the word BE with zero frequency and add AM, ARE etc . In that case the tool will insert as needed. 92 120 183 270 , Lemmatisation in Concord , Stop List See also: Lemmatisation , Match List 9.8 WordList Index 9.8.1 what is an Index for? the point of it One of the uses for an Index is to record the positions of all the words in your text file, so that 1. you can subsequently see which word came in which part of each text. Another is to speed up access to these words, for example in concordancing. If you select one or more words in the index and press , you get a speedy concordance. 289 Another is to compute "Mutual Information" scores which relate word types to each other. 2. 278 3. Or you can use an index to see word clusters . 12 4. Finally, an index is needed to generate concgram searches. 286 284 276 , find filenames , Exporting index data , Viewing Index Lists See also Making an Index List 12 258 269 , WSConcgram , WordList Help Contents for word clusters 9.8.2 making a WordList Index The process is just like the one for making a word-list except that after choosing your texts and ensuring you like the index filename, you choose the bottom button here: December, 2015. Page 276</p> <p><span class="badge badge-info text-white mr-2">292</span> 277 WordSmith Tools Manual In this screenshot above, the basic filename is shakespeare_plays : WordSmith will add .tokens and .types to this basic filename as it works. Two files are created for each index: file: a large file containing information about the position of every word token in your text .tokens files. .types file: knows the individual word types. will check If you choose an existing basic filename which you have already used, WordList whether you want to add to it or start it afresh: 289 278 and Mutual Information scores for each An index permits the computation of word clusters December, 2015. Page 277</p> <p><span class="badge badge-info text-white mr-2">293</span> 278 WordList word type. The screenshot below shows the progress bars for an index of the BNC corpus; on a modern PC it might work at a rate of about 2.8 million words per minute. The resulting BNC.tokens file was 1.6GB in size and the BNC.types file was 26 MB. adding to an index To add to an existing index, just choose some more texts and choose File | New | Index . If the existing file-name is already in use for an index, you will be asked whether to add more or start it afresh as shown above. 258 284 276 , WordList Help Contents . See also Using Index Lists , Viewing Index Lists 9.8.3 index clusters WordList clusters word list doesn't need to be of single words. You can ask for a word list consisting of two, three, A 276 up to eight words on each line. T o do cluster processing in WordList, first make an index . How to see clusters... 284 . Compute | Clusters Open the index. Now choose December, 2015. Page 278</p> <p><span class="badge badge-info text-white mr-2">294</span> 279 WordSmith Tools Manual Words to make clusters from · "all" : all the clusters involving all words above a certain frequency (this will be s-l-o-w for a big corpus like the BNC ), or · "selection": clusters only for words you've selected (eg. you have highlighted BOOK and BOOKS and you want clusters like ). book a table, in my book To choose words which aren't next to each other, press Control and click in the number at the left -- keep Control held down and click elsewhere. The first one clicked will go green and the others white. In the picture below, using an index of the BNC corpus, I selected world and then life by clicking numbers 164 and 167. December, 2015. Page 279</p> <p><span class="badge badge-info text-white mr-2">295</span> 280 WordList The process will take time. In the case of BNC, the index knows the positions of all of the 100 million words. To find 3-word clusters, in the case above, it took about a minute to world and life and find 5,719 clusters like the world process all the 115,000 cases of and bank . Chris Tribble tells me it took his PC 36 hours to compute all 3- of real life word clusters on the whole BNC ... he was able to use the PC in the meantime but that's not a job you're going to want to do often. What you see The cluster size must be between 2 and 8 words. is the minimum number of each that you want to see. The min. frequency omit # : if selected, this won't show any clusters involving numbers and dates omit phrase frames : see phrase frames section below. Here the user has chosen to see any 3-4-word clusters that appear 5 or more times. Working constraints The "max. frequency %" setting is to speed the process up. in more detail... It means the maximum frequency percentage which the calculation of clusters for a given word will process. This is because there are lots and lots of the very high frequency items and you may well not be interested in clusters which begin with them. For example, the item the is likely to be about 6% of any word-list (about 6 million of them in the BNC therefore), and you might not want clusters starting the... -- if so, you might set the max. percent to 0.5% or 0.1% (which for the BNC corpus will cut out the top 102 frequency words). You December, 2015. Page 280</p> <p><span class="badge badge-info text-white mr-2">296</span> 281 WordSmith Tools Manual will still get clusters which include very high frequency items in the middle or , which a in book a table , but would not get in my book end, like the in . The more words you include, the begins with the very high frequency word longer the process will take... 175 Stop at , like Concord clusters , offers a number of constraints, such as sentence and 188 . The idea is that a 5-word cluster which starts in one other punctuation-marked breaks sentence and continues in the next is not likely to make much sense. is another way of controlling how long the process will take. The Max. seconds per word default (0) means no limit. But if you set this e.g. to 30 then as WordList processes the words in order, as soon as one has taken 30 seconds no further clusters will be collected starting with that word. batch processing allows you to create a whole set of cluster word-lists at one time. Phrase frames phrase-frames These are what William H. Fletcher has defined as , i.e. "groups of wordgrams identical but for a single word", in his kfNgram program. Here, processing 23 Dickens novels shows lots of phrase frames where the wildcard word is represented with *. If you double-click the lemmas column (highlighted here in yellow), you get to see the detail. The process joins all the variants of the phrase in the Lemmas column. In the word list itself they will appear deleted (because they have been joined to another item, the phrase frame). You can un-join December, 2015. Page 281</p> <p><span class="badge badge-info text-white mr-2">297</span> 282 WordList them all if you want ( Edit | Joining | Unjoin or Unjoin all ). Omit phrase frames? option. If you don't want to see phrase frames, select the omit phrase frames Here below, the listing has all his hand sequences together but not drawing his hand across , gave his hand to , etc. as shown in the phrase frame view above. December, 2015. Page 282</p> <p><span class="badge badge-info text-white mr-2">298</span> 283 WordSmith Tools Manual Here is a small set of 3-word clusters involving rabies from the BNC corpus. Some of them are plausible multi-word units. It's a word list Finally, remember this listing is just like a single-word word list. You can save it as a .lst file and open it again at any time, separately from the index. 448 269 See also: find the files for specific clusters , clusters in Concord 9.8.4 join clusters The idea is to group clusters like I DON'T THINK NO I DON'T THINK I DON'T THINK SO I DON'T THINK THAT etc. 270 , either so that the smaller clusters get You can join them up in a process like lemmatisation merged as 'lemmas' of a bigger one, or so that the smaller ones end up as 'lemmas'. A BEARING OF In this screenshot, shorter clusters have been merged with longer ones so that FORTY-FIVE DEGREES relates to several related clusters: December, 2015. Page 283</p> <p><span class="badge badge-info text-white mr-2">299</span> 284 WordList visible by double-clicking the lemmas to show something like this: How to do it Choose Edit | Join | Join Clusters in the WordList menu. The process takes quite a time because each cluster has to be compared with all those in the rest of the list; interrupt it if necessary by 123 pressing Suspend . 9.8.5 index lists: viewing In WordList, open an index as you would any other kind of word-list file -- using File | Open. The Easier, in the .tokens. filename will end Controller | Previous lists , choose any index you've made and double-click it. The index look s exactly like a large word-list. (Underneath, it "knows" a lot more and can do more but it looks the same.) December, 2015. Page 284</p> <p><span class="badge badge-info text-white mr-2">300</span> 285 WordSmith Tools Manual The picture above shows the top 10 words in the BNC Corpus. Number 5 (#) represents numbers or words which contain numbers such as £50.00. These very frequent words are also very consistent -- they appear in at least 99% of the 4,054 texts of BNC . In the view below, you see words sorted by the number of Texts: all these words appeared 10 times in the corpus but their frequencies vary. You can highlight one or more words or mark them with the option, then to get a speedy concordance. 278 But its best use to start with is to generate word clusters like these: December, 2015. Page 285</p> <p><span class="badge badge-info text-white mr-2">301</span> 286 WordList 278 276 258 , WordList Help Contents , WordList clusters See also Making an Index List . 9.8.6 index exporting The point of it... An index file knows the position of every single word in your corpus and it is possible therefore to ask it to supply specific data. For example, the lengths of each sentence or each text in the corpus (in words), or the position of each occurrence of a given word. How to do it With an index open, choose File | Export index data, December, 2015. Page 286</p> <p><span class="badge badge-info text-white mr-2">302</span> 287 WordSmith Tools Manual then complete the form with what you need. Here we have chosen to export the details about the word SHOESTRING in a given index, and to get to see all the sentence lengths (of all sentences in the corpus, not just the ones containing that word). A fragment of the results are shown here: December, 2015. Page 287</p> <p><span class="badge badge-info text-white mr-2">303</span> 288 WordList At the top there are word-lengths of some of the 480 text files, the last of which was 6551 words long; then we see the details of 5 cases of the word SHOESTRING in the corpus, which appeared twice in text AJ0.txt, once in J3W.txt etc.; finally we get the word-lengths of all the sentences in the corpus : the first one only 4 words long. This process will be quite slow if you request a lot of data. If you don't check the sentence lengths you will still get text lengths; it wil be quicker if you leave the word details space empty. 9.9 menu search Using the menu you can search for a sub-string within an entry -- e.g. all words containing *fore* -- the asterisk means that the item can be found in the middle of a word, "fore" (by entering *fore* *fore before but not beforehand , while will find will find them both). These searches can so be repeated. This function enables you to find parts of words so that you can edit your word-list, e.g. by joining two words as one. wildcard. You can search for ends or middles of words by using the * Thus other, something , etc. will find *TH* will find booth, sooth , etc. *TH You can then use to repeat your last search. F8 The search hot keys are: F8 repeat last search (use in conjunction with F10 or F11) F10 search forwards from the current line F11 search backwards from the current line F12 search starting from the beginning 270 This function is handy for lemmatization (joining words which belong under one entry, such as seem/ seems/ seemed/ seeming etc.) December, 2015. Page 288</p> <p><span class="badge badge-info text-white mr-2">304</span> 289 WordSmith Tools Manual 109 See also: searching for an entry by typing 9.10 relationships between words 9.10.1 mutual information and other relations the point of it problem is often found A Mutual Information (MI) score relates one word to another. For example, if , they may have a high mutual information score. Usually, the will be found much more with solve problem than solve often near , so the procedure for calculating Mutual Information takes into account not just the most frequent words found near the word in question, but also whether each the word is often found elsewhere, well away from the word in question. Since is found very often indeed far away from problem , it will not tend to be related, that is, it will get a low MI score. 289 There are several other alternative statistics: you can see examples of how they differ here . and k in , it doesn't distinguish between the virtual k ith This relationship is bi-lateral: in the case of k in near k ith , and the much lower likelihood of finding k ith near k in . certainty of finding There are various different formulae for computing the strength of collocational relationships. The MI in WordSmith ("specific mutual information") is computed using a formula derived from Gaussier, 417 , p. 174; here the probability is based on total corpus Lange and Meunier described in Oakes size in tokens. Other measures of collocational relation are computed too, which you will see 289 explained under Mutual Information Display . Settings 4 Controller Main Settings | Advanced | Index under The Relationships settings are found in the 310 WordList . or in a menu option in 294 289 See also: Mutual Information Display , Making an Index List , Computing Mutual Information 276 284 258 , WordList Help Contents . , Viewing Index Lists 417 for further information about Mutual Information, Dice, MI3 etc. Oakes See relationships display 9.10.2 433 : The Relationships procedure contains a number of columns and uses various formulae December, 2015. Page 289</p> <p><span class="badge badge-info text-white mr-2">305</span> 290 WordList Word 1 : the first word in a pair, followed by Freq. (its frequency in the whole index). Word 2 : the other word in that pair, followed by Freq. (its frequency in the whole index). If you have 295 ", then Word 1 precedes Word 2. computed "to right only : the number of texts this pair was found in (there were 23 in the whole index). Texts : the most typical distance between Word 1 and Word 2. Gap 294 : their joint frequency over the entire span Joint (not just the joint frequency at the typical gap distance). In line 7 of this display, BACKWARDS occurs 83 times in the whole index (based on Dickens FORWARDS 8 times. They occur together 62 times. The gap is 2 because novels), and , in these data, typically comes 2 words away from . The pair backwards backwards forwards 295 setting comes in 17 texts. (This search was computed using the to right only * forwards mentioned above). As usual, the data can be sorted by clicking on the headers. Let's now sort by clicking on "Z score" first and "Word 1" second. You get a double sort, main and secondary, because sometimes you will want to see how MI or Z score or other sorting affects the whole list and sometimes you will want to keep the words sorted alphabetically and only sort by MI or Z score within each word-type. Press Swap to switch the primary & secondary sorts. December, 2015. Page 290</p> <p><span class="badge badge-info text-white mr-2">306</span> 291 WordSmith Tools Manual The order is not quite the same ... but not very different either. Both Freq. columns have fairly small numbers. 417 Here is the display sorted by MI3 Score (Oakes p. 172): December, 2015. Page 291</p> <p><span class="badge badge-info text-white mr-2">307</span> 292 WordList Much more frequent items have jumped to the top. 417 Now, by Log Likelihood (Dunning , 1993): Here the Word 2 items are again very high frequency ones and we get at colligation (grammatical collocation). A T Score listing is fairly similar: December, 2015. Page 292</p> <p><span class="badge badge-info text-white mr-2">308</span> 293 WordSmith Tools Manual but a Dice score ordered list brings us back to results akin to the first two shown above: 289 433 294 , Computing Relationships , Mutual Information and other relationships See also: Formulae , 258 284 276 , Viewing Index Lists , WordList Help Contents . Making an Index List 417 for further information about the various statistics offered. See Oakes December, 2015. Page 293</p> <p><span class="badge badge-info text-white mr-2">309</span> 294 WordList 9.10.3 relationships computing 276 WordList Index To compute these relationship statistics you need a . Then in its menu, choose Relationships. Compute | words to process You can choose whether to compute the statistics for all entries, or only any selected (highlighted) entries, or only those between two initial characters e.g. between A and D, or indeed to use your own specified words only. 111 to select only a few items for MI calculation, you can mark them first (with ). Or If you wish 262 always do part of the list (eg. A to D) and later merge you can your mutual-information list with another (E to H). Alternatively you may choose to use only items from a plain text file constructed using the same syntax as a match-list file., or to use all items except ones from your plain text file. omissions omit any containing # , and omit if word1=word2 is there because you might will cut out numbers December, 2015. Page 294</p> <p><span class="badge badge-info text-white mr-2">310</span> 295 WordSmith Tools Manual GOOD GOOD if there are lots of cases where these 2 are found near each other. find that is related to show pairs both ways allows you to locate all the pairs more easily because it doubles up the list. HEAVEN and EARTH For example, suppose we have a pair of words such as . This will normally enter the list only in one order, let us say HEAVEN as word 1 and EARTH as word 2. If you're looking at all the words in the Word 1 column, you will not find EARTH . If you want to be able to see . Here we can the pair as both HEAVEN - EARTH and EARTH - HEAVEN , select show pairs both ways and WITH DUST see this with to right only : if this is checked, possible relations are computed to the right of the node only. That WITH DUST , say, cases of WITH to the right will be noticed but cases where is, when considering is to the left of DUST would get ignored. Here, the number of texts goes down to 5 from 9, MI score is lower, etc, because the process looks only to the right. (In the case of a right-to-left language like Arabic, the processing is still of the words following the node word.) 297 recompute tok en count allows you to get the number of tokens counted again e.g. after items have been edited or deleted. December, 2015. Page 295</p> <p><span class="badge badge-info text-white mr-2">311</span> 296 WordList min. and max max. frequency percent : ignores any tokens which are more frequent than the percentage indicated. Set the maximum frequency, for example, to 0.5% to cut out words whose frequency is greater than that.(The point of this is to avoid computing mutual information for words like the and of , which are likely to have a frequency greater than say 1.0%. For example 0.5%, in the case of the BNC, would mean ignoring about 20 of the top frequency words, GET, BACK, such as WITH, HE, YOU . 0.1% would cut about 100 words including . If you want to include all words, then set this to 100.000) BECAUSE min. frequency : the minimum frequency for any item to be considered for the calculation. (Default = 5; a minimum frequency of 5 means that no word of frequency 4 or less in the index will be visible in the relationship results. If an item occurs only once or twice, the relationship is unlikely to be informative.) stop at allows you to ignore potential relationships e.g. across sentence boundaries. It has to do 188 with whether breaks such as punctuation or sentence breaks determine that one word cannot be related to another. With stop at sentence break, " I wrote the letter. December, 2015. Page 296</p> <p><span class="badge badge-info text-white mr-2">312</span> 297 WordSmith Tools Manual Then I posted it " would not consider posted as a possible collocate of letter because there's a sentence break between them. span : the number of intervening words between collocate and node. With a span of 5, the node wrote would consider the, letter, then, I and posted as possible collocates if stop at no limits in the example above. were set at min. texts : the minimum number of texts any item must be found in to be considered for the calculation. min. Dice/mutual info.MI3 etc: the minimum number which the MI or other selected statistic must come up with to be reported. A useful limit for MI is 3.0. Below this, the linkage between node and collocate is likely to be rather tenuous. Choose whether ALL the values set here are used when deciding whether to show a possible relationship or ANY. (Each threshold can be set between -9999.0 and 9999.0.) Computing the MI score for each and every entry in an index takes a long time: some years ago it took over an hour to compute MI for all words beginning with B in the case of the BNC edition (written, 90 million words) in the screenshot below, using default settings. It might take 24 hours to process the whole BNC, 100 million words, even on a modern powerful PC. Don't forget to save your results afterwards! 289 179 289 See also Collocates , Mutual Information Settings , Mutual Information Display , Detailed 276 268 284 , Viewing Index Lists , Making an Index List Consistency Relations , Recompute Token 258 297 . Count , WordList Help Contents 9.11 recompute tokens Why recompute the tokens? 245 289 we need an estimate of the or Keyness To compute relations such as Mutual Information total number of running words (let's call it TNR) in the text corpus from which the data came. It is December, 2015. Page 297</p> <p><span class="badge badge-info text-white mr-2">313</span> 298 WordList tricky to decide what actually counts as the TNR. Not only are there problems to do with 125 125 125 in the middle of a word, numbers , words , apostrophes and other non-letters hyphenation 120 cut out because of a stoplist etc, but also a decision whether TNR should in principle include all of those or in principle include only the words or clusters now in the list in question. In practice for single-word word lists this usually makes little difference. In the case of word clusters, however, there might be a big difference between the TNR words and TNR clusters, and anyway what exactly 448 ? is meant by running clusters of words if you think about how they are computed 426 For most normal purposes, the total number of running words (tokens ) computed when the word list or index was created will be used for these statistical calculations. How to do it Compute | Tok ens What it affects Any decision made here will apply equally both to the node and the collocate whether these are clusters or single words, or to the little word-list and the reference corpus word-list in the case of key words calculations. If you do choose to recompute the token count, then the TNR will be calculated as the total of the word or cluster frequencies for those entries still left in the list. After any have been zapped or if a minimum frequency above 1 is used the difference may be quite large. not If you choose to recompute, the total number of running words (tokens) computed when the word list or index was created will be used. statistics 9.12 statistics 9.12.1 window: Visible by clicking the Statistics tab at the bottom of a WordList December, 2015. Page 298</p> <p><span class="badge badge-info text-white mr-2">314</span> 299 WordSmith Tools Manual Overall results take up the top row. Details for the individual text files follow below. Statistics include: number of files involved in the word-list file size (in bytes, i.e. characters) December, 2015. Page 299</p> <p><span class="badge badge-info text-white mr-2">315</span> 300 WordList running words in the text ( tokens ) 120 or changes to minimum settings tokens used in the list (would be affected by using a stoplist 314 ) Compute | Tok ens sum of entries: choose to see, otherwise this will be blank no. of different words ( types ) 303 type/token ratios 147 in the text no. of sentences mean sentence length (in words) standard deviation of sentence length (in words) 147 in the text no. of paragraphs mean paragraph length (in words) standard deviation of paragraph length (in words) 146 no. of headings in the text (none here because WordSmith didn't know how to recognise headings) mean heading length (in words) 147 no. of sections in the text (here 480 because WordSmith only noticed 1 section per text) mean section length (in words) standard deviation of heading length (in words) 125 numbers removed 120 stoplist tokens and types removed the number of 1-letter words ... e number of n-letter words (to see these scroll the grid horizontally) th 113 (14 is the default maximum word length. But you can set it to any length up to 50 letters in Word List Settings, in the Settings menu.) Longer words are cut short but this is indicated with a + at the end of the word. ou have The number of types (different words) is computed separately for each text. Therefore if y done a single word-list involving more than one text, summing the number of types for each text will not give the same total as the number of types over the whole collection. Vertical layout If you prefer the layout in previous versions of WordSmith, you can choose to save the statistics vertically in a text file. December, 2015. Page 300</p> <p><span class="badge badge-info text-white mr-2">316</span> 301 WordSmith Tools Manual This lets you choose which ones (any unchecked are zero in the data): and the data will be saved listed vertically. Alternatively you could export the data here to Excel and use its Transpose function to get the rows and columns swapped. December, 2015. Page 301</p> <p><span class="badge badge-info text-white mr-2">317</span> 302 WordList Tokens used for word list In these data, there were over 2.8 million running words of text, but 38,943 numbers were not listed separately, so the number of tokens in the word-list is a little under 2.8 million. MS Word's word count is different! 125 , The number of tokens found is affected by your settings such as treatment of numbers 125 435 and mid-word letter settings (e.g. the apostrophe). For that reason you may hyphens well find that different programs give different values for the same text. (Besides, in the case of MS Word we are not told how a "word" is defined...) can be computed after the word-list is created by choosing Compute | Tok ens Sum of entries of each entry frequencies and will show the total number of tokens now available by adding the (you may have deleted some). 66 318 See also : WordList display (with a screenshot), Summary Statistics , Starts and Ends of Text 146 297 , Recomputing tokens Segments . December, 2015. Page 302</p> <p><span class="badge badge-info text-white mr-2">318</span> 303 WordSmith Tools Manual 9.12.2 type/token ratios 426 ". But a lot of these words will be If a text is 1,000 words long, it is said to have 1,000 "tokens 426 repeated, and there may be only say 400 different words in the text. "Types ", therefore, are the different words. The ratio between types and tokens in this example would be 40%. But this type/token ratio (TTR) varies very widely in accordance with the length of the text -- or corpus of texts -- which is being studied. A 1,000 word article might have a TTR of 40%; a shorter one might reach 70%; 4 million words will probably give a type/token ratio of about 2%, and so on. Such type/token information is rather meaningless in most cases, though it is supplied in a WordList statistics display. The conventional TTR is informative, of course, if you're dealing with a corpus comprising lots of equal-sized text segments (e.g. the LOB and Brown corpora). But in the real world, especially if your research focus is the text as opposed to the language, you will probably be dealing with texts of different lengths and the conventional TTR will not help you much. WordList offers a better strategy as well: the (STTR) is computed standardised type/tok en ratio 113 words as Wordlist goes through each text file. By default , n = 1,000. In other words the n every ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.) Setting the N boundary 314 Adjust the n number in Minimum & Maximum Settings to any number between 100 and 20,000. What STTR actually counts 270 as a word (so and Note: The ratio is computed a) counting every different form say are two says 120 c) those which are within the length types) b) using only the words which are not in a stop-list 446 435 into account. and hyphens you have specified, d) taking your preferences about numbers The number shown is a percentage of new types for every n tokens. That way you can compare 417 type/token ratios across texts of differing lengths. This method contrasts with that of Tuldava (1995:131-50) who relies on a notion of 3 stages of accumulation. The WordSmith method of computing STTR was my own invention but parallels one of the methods devised by the mathematician David Malvern working with Brian Richards (University of Reading). Further discussion TTR and STTR are both pretty crude measures even if they are often assumed to imply something ELEPHANT, about "lexical density". Suppose you had a text which spent 1,000 words discussing , etc., then 1,000 discussing MADONNA, ELVIS etc, and then 1,000 discussing LION, TIGER CLOUD, RAIN, SUNSHINE . If you set the STTR boundary at 1,000 and happened to get say 48% or so for each section, the statistic in itself would not tell you there was a change involving Africa, Music, Weather. Suppose the boundary between Africa & Music came at word 650 instead of at word 1,000, I guess there'd be little or no difference in the statistic. But what make a would difference? A text which discussed clouds and written by a person who distinguished a lot between This would be higher types of cloud might also use MIST, FOG, CUMULUS, CUMULO-NIMBUS. HIGH, but used adjectives like CLOUD in STTR than one written by a child who kept referring to DARK, to describe the clouds... and who repeated LOW, HEAVY, DARK, THIN, VERY THIN , etc a lot in describing them... THIN December, 2015. Page 303</p> <p><span class="badge badge-info text-white mr-2">319</span> 304 WordList (NB. Sh akespeare is well known to have used a rather limited vocabulary in terms of measures like these!) summary statistics 9.12.3 A word list's statistics give you data about the corpus, but you may need more specific information about individual words in a word list too. How many end in -ly? Press Count to get something like this: December, 2015. Page 304</p> <p><span class="badge badge-info text-white mr-2">320</span> 305 WordSmith Tools Manual There is no limit on the searches: Cumulative Column A cumulative count adds up scores on another column of data apart from the one you are processing for your search. The columns in this window are for numerical data only. Select one and ensure activated is ticked. December, 2015. Page 305</p> <p><span class="badge badge-info text-white mr-2">321</span> 306 WordList In this example, a word-list was computed and a search was made of 4 common word endings (and one ridiculous one). For -LY there are 2,084 types, with a total of 41,886 tokens in this corpus. - ITY and -NESS are found at the ends of fairly similar numbers of word-types, but -ITY has many more tokens in these data. Breakdown 215 See the example for Concord Load button 66 see the explanation for count data frequencies . 9.13 stop-lists and match-lists 120 In WordList, a stop list is used in order to filter out some words, usually high-frequency words, 92 is to be able to compare all that you want excluded from your word-list. The idea of a match-list the words in your word list with another list in a plain text file and then do one of a variety of December, 2015. Page 306</p> <p><span class="badge badge-info text-white mr-2">322</span> 307 WordSmith Tools Manual operations such as deleting the words which match, deleting those which don't, or just marking the ones in the list. For both, you can define your own lists and save them in plain text files. Settings are accessed via the WordList menu or by an Advanced Settings button in the Controller 270 120 See also: lemma lists , general explanation of stop-lists 9.14 import words from text list the point of it You might want a word list based on some data you have obtained in the form of a list, but whose original texts you do not have access to. requirements 81 (select this before you make the list), and can be in Your text file can be in any language Unicode or ASCII or ANSI, plain text. <Tab> characters are expected to separate the columns of data. Decimal points and commas will be ignored. Words will have leading or trailing spaces trimmed off. The words do not need to be in frequency or alphabetical order. You need at least a column with words and another with a number representing each word's frequency. example ; My word list for test purposes. THIS 67,543 December, 2015. Page 307</p> <p><span class="badge badge-info text-white mr-2">323</span> 308 WordList IT 33,218 WILL 2,978 BE 5,679 COMPLETE 45 AND 99,345 UTTER 54 RUBBISH 99 IS 55,678 THE 678,965 You should get results like these. Statistics are calculated in the simplest possible way: the word-lengths (plus mean and standard deviation), and the number of types and tokens. Most procedures need to know the total number of running words (tokens) and the number of different word types so you should manage to use the word-list in KeyWords etc. The total is computed by adding the frequencies of each word-type ( 67543+33218+2978 etc. in the example above). Optionally, a line can start \TOTAL=\ and contain a numerical total, eg. \TOTAL=\ 299981 In this case the total number of tokens will be assumed to be 299981, instead. how to do it When you choose the New menu option ( ) in WordList you get a window offering three tabs: a tab for most usual purposes, Main December, 2015. Page 308</p> <p><span class="badge badge-info text-white mr-2">324</span> 309 WordSmith Tools Manual 263 one for , and another ( Advanced ) for creating a word list using a plain text Detailed Consistency file. Set the word column and frequency column appropriately according to the tabs in each line. (Column 1 assumes that the word comes first before any tabs; in the case of CREA's Spanish word-list there is a column for ranking so the word and frequency columns would need to be 2 and 3 respectively.) Choose your .txt file(s) and a suitable folder to save to, add any notes you wish, and press create word list(s) now . 9.15 settings Enter topic text here. December, 2015. Page 309</p> <p><span class="badge badge-info text-white mr-2">325</span> 310 WordList 9.15.1 WordSmith controller: Index settings Index File The filename is for a default index which you wish to consider the usual one to open. thorough concordancing : when you compute a concordance from an index, you will either get ( thorough checked) or not get (if not checked) full sentence, paragraph and other 166 as in a normal concordance search. (Computing these statistics takes a statistics little longer.) show if frequency at least : determines which items you will see when you load up the index file. (What you see looks like a word list but it is reading the underlying index.) Clusters the minimum and maximum sizes are 2 and 8. Set these before you compute a multi-word word 278 list based on the index. A good maximum is probably 5 or 6. stop at: you can choose where you want cluster breaks to be assumed. With the setting above (no limits), " I wrote the letter. Then I posted it " would consider letter as a possible multi-word string even though there's a sentence break then I posted 188 between them. Relationships 294 See relationships computing . December, 2015. Page 310</p> <p><span class="badge badge-info text-white mr-2">326</span> 311 WordSmith Tools Manual 9.15.2 WordSmith controller: WordList settings 4 marked . These are found in the main Controller WordList 314 This is because some of the choices -- e.g. Minimum & Maximum Settings -- may affect other Tools. What you Get and What you See . There are 2 sets : WHAT YOU GET Word Length & Frequencies 314 See Minimum & Maximum Settings . Standardised Type/Token # 303 See WordList Type/Token Information . December, 2015. Page 311</p> <p><span class="badge badge-info text-white mr-2">327</span> 312 WordList Detailed Consistency = a total frequency for a word to be included in the Detailed Min. frequency overall 263 list. Consistency = the minimum number of texts that word must appear in. Min. texts WHAT YOU SEE Tags By default you get "words only, no tags". If you want to include tags in a word list, you need 141 first. Then choose one of the options here. to set up a Tag File BECAUSE <w CJS> or In the example here we see that is classified by the BNC either as a a . (That's how the BNC classifies BECAUSE OF ...) <w PRP> December, 2015. Page 312</p> <p><span class="badge badge-info text-white mr-2">328</span> 313 WordSmith Tools Manual 315 For colours and tags see WordList and Tags . Columns The Columns to show/hide list offers all the standard columns: you may uncheck ones you normally do not wish to see. This will only affect newly computed KeyWords data: earlier data uses the column visibility, size, colours etc already saved. They can be altered using the 87 Layout menu option at any time. Case Sensitivity Normally, you'll make a case-insensitive word list. If you wish to make a word list which 314 the , and THE , activate case sensitivity . distinguishes between The Lemma Visibility By default in a word-list you'll see the frequency of the headword plus the associated forms; show headword frequency only box, the frequency column will ignore the if you check the associated wordform frequencies. Similarly, if you check omit headword from lemma column you will see only the associated forms there. December, 2015. Page 313</p> <p><span class="badge badge-info text-white mr-2">329</span> 314 WordList 284 258 276 , WordList and , WordList Help Contents , Viewing Index Lists See also: Using Index Lists 315 278 . tags , Computing word list clusters 9.15.3 minimum & maximum settings These include: minimum word length Default: 1 letter. When making a word-list, you can specify a minimum word length, e.g. so as to cut out all words of less than 3 letters. maximum word length Default: 49 letters. You can allow for words of up to 50 characters in length. If a word exceeds the limit and Abbreviate with + is checked, WordList will append a + symbol at the end of it to show that it was cut short. (If Abbreviate with + is not checked, the long word will be omitted from your word list. You might wish to use this to set both minimum and maximum to say, 4, and leave Abbreviate with + un-checked – that way you'll get a word-list with only the 4-letter words in it. minimum frequency Default: 1. By default, all words will be stored, even those which occur once only. If you want only the more frequent words, set this to any number up to 32,000. maximum frequency Default maximum is 2,147,483,647 (2 Gigabytes). You'd have to analyse a lot of text to get a word which occurred as frequently as that!. You might set this to say 500, and the minimum to 50: that way your word-list would hold only the moderately common words. type/token mean number (default 1,000) Enables a smoothed calculation of type/token ratio for word lists. Choose a number between 10 and 303 20,000. For a more complete explanation, see WordList Type/Token Information . 124 120 113 See also: Text Characteristics , Setting Defaults , Stop Lists 9.15.4 case sensitivity Normally, you'll make a case-insensitive word list, especially as in most languages capital letters are used not only to distinguish proper nouns but also to signal beginnings of sentences, headings, etc. If, however, you wish to make a word list which distinguishes between major, Major 4 ). in the Controller WordList Settings | Case Sensitivity and MAJOR, activate case sensitivity ( When you first see your case-sensitive list, it is likely to appear all in UPPER CASE. Press Ctrl+L 87 ) to change this. menu option ( Layout or choose the December, 2015. Page 314</p> <p><span class="badge badge-info text-white mr-2">330</span> 315 WordSmith Tools Manual 9.16 sorting How to do it... Sorting can be done simply by pressing the top row of any list. Press again to toggle between ascending & descending sorts. With a word-list on your screen, the main Frequency window doesn't sort, but you can re-sort the Alphabetical window (look at the tabs at the bottom of WordList to choose the tab) in a number of different ways. The menu offers various options. Alphabetical Word Sort Many languages have their own special sorting order, so prior to sorting or re-sorting, check that you 81 for the words being sorted. Spanish, for example, uses this have selected the right language order: A,B,C,CH,D,E,F,G,H,I,J,K,L,LL,M,N,Ñ,O,P,Q,R,S,T,U,V,W,X,Y,Z. KeyWords and other comparisons require an alphabetically-ordered list in ascending order. If you get problems, please open the word lists in WordList, choose the "alphabetical" tab, sort by pressing the "Word" header until the sort is definitely alphabetical ascending, then choose the Save menu option. Reverse Word Sort (Ctrl+F6) This is so that you can sort words by suffix. The order is determined by word endings, not word -ing forms together. beginnings. You will therefore find all the Word Length Sort (Shift+Ctrl+F6) This is so that you can sort words by their length (1-letter, 2-letter, etc up to 50-letter words) Within a set of equal-length words, there's a second, alphabetical sort. Consistency Sort 263 Press the "Texts" header to re-sort the words according to their consistency . 208 252 72 419 , Editing entries ; See also: Concord sort ; Accented characters , KeyWords sort 81 Choosing Language 9.17 WordList and tags 311 to load it, you can get a word-list If you have defined a tag file and made the appropriate settings which treats tags and words separately as in this example, where the tag is viewed as if it were a prefix. A word list only of tags? WordList settings | What you see Choose whether you want only the tags, only the words or both in | Tags : December, 2015. Page 315</p> <p><span class="badge badge-info text-white mr-2">331</span> 316 WordList In its Alphabetical view, the list can be sorted on the tag or the word. To colour these as in the example, in the main Controller I chose Blue for the foreground for tags (as the default is a light grey). December, 2015. Page 316</p> <p><span class="badge badge-info text-white mr-2">332</span> 317 WordSmith Tools Manual Then in WordList, I chose View | Layout as in this screenshot, selected the Word column header and chose green below. December, 2015. Page 317</p> <p><span class="badge badge-info text-white mr-2">333</span> 318 WordList 9.18 WordList display Each WordList display shows · the word · its frequency · its frequency as a percent of the running words in the text(s) the word list was made from · the number of texts each word appeared in · that number as a percentage of the whole corpus of texts The Frequency display might look like this: Here you see the top 7 word-types in a word list based on 480 texts. There are 72,028 occurrences 426 of these words (tokens ) altogether but in the screenshot we can only see the first few. The Freq. column shows how often each word cropped up ( THE look s as if it appeared 72,010 times in the 480 texts), and the % column tells us that the frequency represents 6.07% of the running words in those texts. The Texts column shows that THE comes in 480 texts, that is 100% of the texts used for the word list. If we pull the Freq. column a little wider (cursor at the header edge then pull right) so that the 72,010 doesn't have any purple marks beside it, December, 2015. Page 318</p> <p><span class="badge badge-info text-white mr-2">334</span> 319 WordSmith Tools Manual we see the true frequency value is actually 172,010. Another thing to note is that there seems to be a word #, with over 50 thousand occurrences. . That represents a number or any word with a number in it such as EX658 The Alphabetical listing also shows us some of the words but now they're in alphabetical order. ABANDON comes 43 times altogether, and in 37 of the 480 texts (less than 8%). ABANDONED , on the other hand, not only comes more often (78 times) but also in more texts (14% of them). Now let's examine the statistics. December, 2015. Page 319</p> <p><span class="badge badge-info text-white mr-2">335</span> 320 WordList In all 480 texts, there are 72,028 word types (as pointed out above). The total running words is 2,833,815. Each word is about 4.57 characters in length. There are 107,073 sentences altogether, , there are only 1,571 different word types on average 26.47 words in length. In the text of a00.txt and that interview is under 7,000 words in length. This is explained in more detail in the Statistics 298 page. Finally, here is a screenshot of the same word list sorted "reverse alphabetically". In the part which we can see, all the words end in -IC . December, 2015. Page 320</p> <p><span class="badge badge-info text-white mr-2">336</span> 321 WordSmith Tools Manual To do a reverse alphabetical sort, I had the Alphabetical window visible, then chose Edit | Other in the menu. To revert to an ordinary alphabetical sort, press F6. sorts | Reverse Word sort 262 270 See also : Consistency , Lemmatisation December, 2015. Page 321</p> <p><span class="badge badge-info text-white mr-2">337</span> WordSmith Tools Manual Utility Programs Section X</p> <p><span class="badge badge-info text-white mr-2">338</span> 323 WordSmith Tools Manual 10 Utility Programs Besides the three main programs, there are more Tools that have arisen over the years; this Chapter explains them. Character Profiler lists characters used in your texts 405 7 like WordList but for sequences of characters CharGrams to find anomalous texts Corpus Corruption 7 Detector 8 File Utilities various utilities for managing files 8 File Viewer shows the innards of your text files 8 Minimal Pairs identifies similar words 354 prepares your corpora for different formats Text Converter shows translated texts Viewer and Aligner 379 12 finds and shows concgrams WSConcGram 10.1 Convert Data from Previous Versions 10.1.1 Convert Data from Previous Versions As WordSmith Tools develops, it has become necessary to store more data along with any given 81 word-list, concordance etc. For example, data about which language (s) were selected for a 29 now stored with every type of results file, etc. Therefore it has been concordance, notes necessary to supply a tool to convert data from the formats used in WS 1.0 to 3.0 (last millennium) to the new format for the current version. This is the Data Converting tool. If you try to open a file made with a previous version you should be offered a chance to convert it first. Note: as WordSmith develops, its saved data may get more complex in format. A concordance saved by WordSmith 5.0 cannot be guaranteed to be readable by WordSmith 4.0 for that reason, and a 6.0 one may require version 6.0, etc. December, 2015. Page 323</p> <p><span class="badge badge-info text-white mr-2">339</span> 324 Utility Programs 10.2 WebGetter 10.2.1 overview The point of it The idea is to build up your own corpus of texts, by downloading web pages with the help of a search engine. What you do Just type a word or phrase, check the language, and press Download . How it works WebGetter visits the search engine you specify and downloads the first 1000 sources or so. Basically it uses the search engine just as you do yourself, getting a list of useful references. Then it sends out a robot to visit each web address and download the web page in each case (not from the search engine's cache but from the original web-site). Quite a few robots may be out there searching for you at once -- the advantage of this is that one slow download doesn't hold all the others up. After downloading a web page, that WebGetter robot checks it meets your requirements (in Settings 325 ) and cleans up the resulting text. If the page is big enough, a file with a name very similar to the web address will be saved to your hard disk. When it runs out of references, re-visits the search engine and gets some more. WebGetter 326 325 328 , Limitations , Display See also: Settings December, 2015. Page 324</p> <p><span class="badge badge-info text-white mr-2">340</span> 325 WordSmith Tools Manual 10.2.2 settings Language Choose the language you require from the drop-down list. Search Engine The search engine box allows you to choose for example www.google.com.br for searches on Brazilian Portuguese or www.google.fr for French. That is a better guarantee of getting text in the language you require! Folder and Time-out December, 2015. Page 325</p> <p><span class="badge badge-info text-white mr-2">341</span> 326 Utility Programs where the texts are to be stored. By defaults it is the \wsmith5 · folder stemming from your c:\temp . The folder you specify will act as a root. That is, if you specify My Documents and search for "besteirol", results will be stored in . If you do another c:\temp\besteirol search on say "WordSmith Tools", results for that will go into c:\temp\WordSmith Tools . · WebGetter robot stops trying a given webpage if timeout: the number of seconds after which there's no response. Suggested value: 50 seconds. Requirements · minimum file length (suggested 20Kbytes): the minimum size for each text file downloaded from the web. Small ones may just contain links to a couple of pictures and nothing much else. goes through the · minimum words (suggested: 300): after each download, WebGetter downloaded text file counting the number of words and won't save unless there are enough. required words: you may optionally type in some words which you require to be present in · each download; you can insist they all be present or any 1 of these. Clean-up If you want all the HTML markup removed, you can check this box, setting a suitable span between < and > markers, 1000 recommended. Advanced Options If you work in an environment with a "Proxy Server", WebGetter will recognise this automatically and use the proxy unless you uncheck the relevant box. If in doubt ask your network administrator. You can specify the whole search URL and terms string yourself if you like with a box in the Advanced options. 328 326 , Limitations See also: Display display 10.2.3 As WebGetter works, it shows the URLs visited. If greyed out, they were too small to be of use or haven't been contacted yet. There is a tab giving access to a list of the successfully downloaded files which will show something like this. December, 2015. Page 326</p> <p><span class="badge badge-info text-white mr-2">342</span> 327 WordSmith Tools Manual Double-click a file to view and, if you like, edit it in Notepad. The URLS list looks like this December, 2015. Page 327</p> <p><span class="badge badge-info text-white mr-2">343</span> 328 Utility Programs Just double-click an URL to view it in your browser. 325 328 , Limitations See also: Settings limitations 10.2.4 Everything depends on the search engine and the search terms you use. The Internet is a huge noticeboard; lots of stuff on it is merely ads and catalogue prices etc. The search terms are collected by the search engines by examining terms inserted by the web page author. There is no guarantee that the web pages are really "about" the term you specify, though they should be roughly related in some way. 325 Use the Settings to be demanding about what you download, e.g. in requiring certain words or phrases to be present. 326 See also: Display December, 2015. Page 328</p> <p><span class="badge badge-info text-white mr-2">344</span> 329 WordSmith Tools Manual 10.3 Corpus Corruption Detector 10.3.1 Aim The purpose is to check whether one or more of your text files in your corpus doesn't belong. This could be because · it has got corrupted so what used to be good text is now just random characters or has got cut much shorter because of disk problems · it isn't even in the same language as the rest of the corpus The tool works in any language. It does it by using a known sample of good text (in whatever language) and comparing that good text with all your corpus. 329 See also : How to do it 10.3.2 How it works 1. Choose a set of "known good text files" which you're sure of. The program uses these to evaluate the others. When you click the button for known good text files, you can choose a number. You might choose 20 good ones so as to get a lot of information about what your corpus is like. 2. Choose your corpus head folder and check the "include sub-folders" box if your corpus spreads over that folder and sub-folders. 466 in it, eg. 3. The program will anyway look out for oddities such as a text file which has holes where the system thinks it's 1000 characters long but there are only 700. 4. If you check the "digraph check" box it will additionally check that the pairs of letters (digraphs) are of roughly the right frequency in each text file. For example there should be a lot of TH combinations if your text is in English, and no QF combinations. If you are working with a corpus in Portuguese and your text files are in Portuguese too, of course the digraphs will be different, and TH won't be frequent. The program ignores punctuation. December, 2015. Page 329</p> <p><span class="badge badge-info text-white mr-2">345</span> 330 Utility Programs 5. If you are doing a digraph check you can vary certain parameters such as how much variation there may be between the frequencies of the digraphs (a sensible setting for "frequency variation per 1000" could be 30 (in other words 3%)), and "percent fail allowed" (which might be set at say 25 -- this means that up to 25% of the digraph pairs may be out of balance before an alert is sounded). 6. Press Start. You will see the progress bar moving forward. If you see a file-name in the top-left box, a click on it will indicate why it was found questionable. Double-clicking it will open up the text in the window below so you can examine it carefully. Filenames of possibly corrupted texts are yellow if the basic check fails, and cream-coloured if the reason is because of a diagraph mis-match. In the screenshot, PEN000884.txt is problematic because the file-size on disk is 2591 (there should be 2591 characters) but there are only 158, as shown in the statusbar at the bottom. In the case of PEOP020151.txt, the text appears below (after double-clicking the list), December, 2015. Page 330</p> <p><span class="badge badge-info text-white mr-2">346</span> 331 WordSmith Tools Manual and the status bar says the tool has found an imbalance in the digraphs. The text itself has a lot of blank space at the top but otherwise looks OK (it is supposed to be in Spanish) but the detector has flagged it up as possibly defective. 10.4 Minimal Pairs 10.4.1 aim A program for finding possible typos and pairs of words which are minimally different from each other (minimal pairs). For example, you may have a word list which contains ALEADY 5 and ALREADY 461, that is, your texts contain 5 instances where there is a possible misprint and 461 which are correct. This program helps to find possible variants and typos and anagrams. 332 333 336 337 , choosing your files , output , rules and settings , running the See also : requirements 338 program . December, 2015. Page 331</p> <p><span class="badge badge-info text-white mr-2">347</span> 332 Utility Programs 10.4.2 requirements A word-list in text format. Each line should contain a word and its frequency separated by tabs, e.g. or 6 You can make such a list using WordList . For example, select (highlight) the columns containing the word and its frequency and copy to the clipboard, then paste into Notepad, or save as TXT (without numbers or heading row): December, 2015. Page 332</p> <p><span class="badge badge-info text-white mr-2">348</span> 333 WordSmith Tools Manual giving this: 338 336 333 331 337 , running the program See also : aim , choosing your files , output , rules and settings . choosing your files 10.4.3 Choose your input word list (which must be in plain text format) by clicking the button at the right of the edit space and finding the word list .txt file. December, 2015. Page 333</p> <p><span class="badge badge-info text-white mr-2">349</span> 334 Utility Programs 337 Type in an appropriate file-name for your results. Choose the rules too. When you're ready, press Compute button. the You'll then be asked to choose the columns and rows (allowing you to skip header lines or the number column if your txt file has those). December, 2015. Page 334</p> <p><span class="badge badge-info text-white mr-2">350</span> 335 WordSmith Tools Manual Here, the first three lines are greyed out, so we need to alter the Rows box: 338 336 332 331 337 , running the program . , output , requirements See also : aim , rules and settings December, 2015. Page 335</p> <p><span class="badge badge-info text-white mr-2">351</span> 336 Utility Programs output 10.4.4 An example of output is 418 ALTHOUGHT (7) ALTHOUGH(37975) Here the lines are numbered, and the bracketed numbers mean that ALTHOUGHT occurred 7 times and ALTHOUGH 37,975 times. An example using Dutch medical text, lower case: aplasie (1) 136 aplasia(1)[L] 137 (1) apyogeen(1)[S] apyogene 138 (1) arachnoidales(1)[I] arachnoideales Here line 136 generated a 1-Letter difference, 137 a Swap and 138 an Insertion. An example using Guardian newspaper, looking for anagrams: ADIEU(43)[A] 35 AUDIE (7) 36 ASSAB(16)[A] ABASS (6) 37 AGUIAR (6) AURIGA(11)[A] ALRED'S (6) ADLER'S(18)[A] 38 ANDOR (6) 39 ADORN(128)[A] an example where the alternatives are separated with commas but the rule and frequencies are not shown. 337 333 331 332 See also : aim , requirements , choosing your files , rules and settings , running the 338 . program December, 2015. Page 336</p> <p><span class="badge badge-info text-white mr-2">352</span> 337 WordSmith Tools Manual 10.4.5 rules and settings Rules Insertions (abxcd v. abcd) This rule looks for 1 extra letter which may be inserted, e.g. HOWWEVER Swapped letters (abcd v. acbd) This rule looks for letters which have got swapped, e.g. HOVEWER (abcd v. abxd) 1 letter difference This rule looks for a 1 letter difference, e.g. HOWEXER (abcd v. adbc) Anagrams too This rule looks for the same letters in a different order, e.g. HWVROEE Settings: end letters to ignore if at last letter: This rule allows you to specify any letters to ignore if at the end of the word, e.g. if you specify "s", the possibility of a typo when comparing ELEPHANT and ELEPHANTS will not be reported. minimum word length This setting specifies the minimum word length for the program to consider the possibility there is a typo. The default is 5, which means 4-letter words will be simply ignored. This is to speed up processing, and because most typos probably occur in longer words. letters to ignore at start of word December, 2015. Page 337</p> <p><span class="badge badge-info text-white mr-2">353</span> 338 Utility Programs This setting (default =1) allows you to assume that when looking for minimal pairs there is a part of each at the beginning which matches perfectly. For example, when considering ALEADY, the program probably doesn't need to look beyond words beginning with A for minimal pairs. If the setting is 1, it will not find BLEADY as a minimal pair. To check all words, take the setting down to 0. The program will be 26 times slower as a result! only words starting with ... If you choose this option, the program will ignore the next setting (max. word frequency). Here you can type in a sequence such as F,G,H and if so, the program will take all words beginning F or G or H (whatever their frequency) and look for minimal pairs based on the rules and settings above. max. word frequency (ignored if "all words starting with" is checked) How frequent can a typo be? This will depend on how much text your word-list is based on. The default is 10, which means that any word which appears 11 times is assumed to be OK, not a typo. Factory Defaults (restores default values) 338 336 333 332 331 , running the program See also : aim , requirements , choosing your files , output . running the program 10.4.6 Press "Compute". · You should then see your source text, with a few lines visible. Some of the rows and columns may be greyed and others white: move the column and row numbers till the real data are white and any headings or line-numbers are greyed out. December, 2015. Page 338</p> <p><span class="badge badge-info text-white mr-2">354</span> 339 WordSmith Tools Manual Here the first three lines are greyed out, and that can be fixed by changing Rows from 4 to 1. Once you press OK the program starts: December, 2015. Page 339</p> <p><span class="badge badge-info text-white mr-2">355</span> 340 Utility Programs If you want to stop in the middle, press "Stop". You can press "Results" to see your results file, when you have finished. 337 333 332 331 336 , rules and settings , output See also : aim , requirements , choosing your files File Viewer 10.5 10.5.1 Using File Viewer Aim To help you examine files of various kinds to see what is in them. This might be in order to see whether they’re really in plain text format · to see whether there's something wrong with them, such as unusual characters which oughtn't · to be there · to see whether they were formatted for Windows, Mac, or for Unix December, 2015. Page 340</p> <p><span class="badge badge-info text-white mr-2">356</span> 341 WordSmith Tools Manual 444 · to check out any hidden stuff in there. (A for example will have lots of hidden Word .doc stuff you don’t see on the screen but is inside the file anyway, such as the name of the author, type of printer being used, etc.) · to find strings of words in a database, a spreadsheet or even a program file. · to get certain selected characters picked out in an easy-to-find colour Here you can see the gory details of the text. Some characters are highlighted in different colours so you can see exactly how the text is formatted. Loading a text file Choose your file – if necessary click on the button at the right of the text-input box. Press Show . Characters The two options available are as 1 bytes or 2 to represent each character-symbol in the text in question. You may need to alter this setting to see your text in a readable format. The two windows The left window shows how the "text" is built up. You can see each character as a number and, further to the right, as a character. The right window shows the text, paragraph by paragraph, word-wrapped so you can read it. Searching Just type in the search-word and press Search. The search is case sensitive and is not a "whole word" search. Synch Press the Synch button to synchronise the two displays. The display you clicked last is the "boss" for synchronising. Settings December, 2015. Page 341</p> <p><span class="badge badge-info text-white mr-2">357</span> 342 Utility Programs Colours The colour grids let you see the number section in special colours, so you can find the potential problems you’re interested in. · First select the character you want coloured. · Click the foreground or background colour list change the colour. The character names are Unicode names. In the picture the symbol with the 003E code (>) is the last one clicked. Font Choose the font and size in the font window. You may need to change font if you want to see Chinese etc. represented correctly. Columns o You can set the “hex” columns between 2 and 16. o You can see the numbers at the left of the main window in hex or decimal. 10.6 File Utilities index 10.6.1 December, 2015. Page 342</p> <p><span class="badge badge-info text-white mr-2">358</span> 343 WordSmith Tools Manual This sub-program supplies a few file utilities for general use: 347 Compare Two Files 348 File Chunker 348 Find Duplicates 349 Rename 466 " in text files Find Holes: for "holes 343 Splitter 346 Joiner Move files to sub-folder 10.6.2 Splitter Splitter: index Explanations 343 What is the Splitter utility and what's it for? 344 Filenames 345 Wildcards 2 See also : WordSmith Main Index aim of Splitter needs to know: Splitter This is a sub-program for splitting large files into lots of small ones. Start/End of Section Separator [FF] </ or <end of story> or The symbol which will act as a start or end-of-text separator: eg. !# [FF?????] Text> or or CHAPTER # [FF*] or or Restrictions: 1 The start/end-of-text marker must occur at the beginning of a line in the original large file. 2 It is case sensitive: </Text> will not find </text> . 345 such as or ? . 3 The first character in the separator may not be a wildcard #,* may occur only once each in the separator. 4 * # and will create a new file every time it encounters the start/end-of-text marker you've specified. Splitter The end of text box determines whether the line containing the separator gets included in the previous or new text file. Destination Folder Where you want the small files to be copied to. (You'll need write permission to access it if on a network.) Required sizes The minimum and maximum number of lines that your small files can have (default = 5 and December, 2015. Page 343</p> <p><span class="badge badge-info text-white mr-2">359</span> 344 Utility Programs 30,000). Only files within these limits will be saved. This feature is useful for extracting files from very large CD-ROM files. A "line" means from on e <Enter> to the next. Bracket first line Whether or not you want the first line of each new text file to be bracketed inside < > marks. (If your separator is a start-of-section separator like CHAPTER with a number, you may wish that to be in brackets. And often the first line after an end-of-text symbol will contain some kind of header.) If you don't want it to insert < and > around the line, leave this box unchecked. Title Line If you know that a given line of your texts always contains the title for the sub-text in question, set this counter to that number, otherwise leave it at 0. For example, where you know that every line immediately following <end of story> has a title for the next story, you could put 1. Example : ... <end of story> Visiting New York ... The file-name created for each story will contain the title as well as a suitable number. In this . example a file-name might end up as C:\texts\split\Visiting New York 0004.txt 355 441 345 344 346 , The buttons , Text Converter index . Se e also: Joiner , Filenames , Wildcards Splitter: filenames Splitter will create lots of small files based on your large one(s). December, 2015. Page 344</p> <p><span class="badge badge-info text-white mr-2">360</span> 345 WordSmith Tools Manual It creates filenames as sub-files of a folder based on the name of each text file. In this screenshot, it has found a file called and is C:\temp\G_O\The Observer\2002\Home_news\Apr 07.txt creating a set of results listed 1 to 11 or more, using the specified destination folder plus the same folder structure as the original texts. Each sub-text is numbered 0001.txt, 0002.txt etc. Sub-folders are created if there are too many files for a folder. If a title is detected, each file will contain the title plus a number and .txt. If there is no title, the filename will be the number + .txt added as a file extension. Tips 1. Splitter will start numbering at 1 each session. . Note that the small files will probably take up a lot more room than the original large file did. 2 This is because the disk operating system has a fixed minimum file size. A one-character tex t file will require this minimum size, which will probably be several thousand bytes in size. Even so, I suggest you keep your text files such that each file is a separate text, by using Splitter. When 39 doing word lists and key words lists, though, do them in batches . 3. CD-ROM files when copied to your hard disk may be read-only. You can change this 355 . attribute using Text Converter Splitter: wildcards The hash symbol, , is used as a wildcard to represent any number , so [FF#] would find [FF3] # # [FF9987] but not [FF] or [FF 9] or [FFhello] . (because there's a space in it) or * string , so [FF* The asterisk represents any would find all of the above. * is used as the last character in the end-of-text symbol. It would find [FF anything at all up to the next <Enter> . December, 2015. Page 345</p> <p><span class="badge badge-info text-white mr-2">361</span> 346 Utility Programs The ^ mark represents any single , so [FF^^] would find [FFZQ] but none of the others. ^ letter The question mark represents any single character (including spaces, punctuation, letters), so ? would find [FF??] in the above examples, but none of the others. [FF 9] #,^,? To represent a genuine * , put each one in double quotes, eg. "?" "#" "^" "*" . or 28 343 , Wildcards See also: Settings join text files 10.6.3 This is a sub-program for joining small text files into bigger ones. You might want this because you aren't interested in the different texts individually but are only interested in studying the patterns of a whole lot of texts. When you choose Joiner you will see something like this: End of text marker [FF] or <end of story> or </Text> The symbol which will act as an end-of-text separator: eg. !# or or or [FF?????]. The end-of-text marker will come at the beginning of a line in the original [FF*] large file. If it includes # this will be replaced by the number of the text as the texts are processed. Folder with files to join Where the small files you want to be merged are now. They will not get deleted -- you must merge them into the Destination folder. December, 2015. Page 346</p> <p><span class="badge badge-info text-white mr-2">362</span> 347 WordSmith Tools Manual and sub-folders too Check this if you want to process sub-folders of the "folder with files to join". file specifications or *.txt or The kinds of text files you want to merge, eg. . *.* *.txt;*.ctx Destination Folder Where you want the small files to be copied and merged to. (You'll need write permission to access it if on a network.) recreate same sub-folders as source If checked, creates the same structure as in the source. In the example, all the sub-folders of will be created below d:\text\guardian_joined . d:\text\guardian_cleaned one text for each folderful if checked, a whole folderful of source texts will go into one text file in the destination. Max. size (Kbytes) The maximum size in kilobytes that you want the each merged text file to be. 1000 means you will get almost 1 megabyte of text into each. That is about 150,000 words if there are no tags isn't checked. and the text is in English. This only applies if one text for each folderful Stop button Does what it says on the caption. 343 355 Se e also: Splitter , Text Converter index . 10.6.4 compare two files The point of it The idea is to be able to check whether 2 files are similar or not. You may often make copies of files 348 and a few weeks later cannot remember what they were. Or you have used File Chunker to copy a big file to floppies and want to be sure the copy is identical to the original. This program checks whether a) they are the same size b) they have the same contents (it goes through both, byte by byte, checking whether they match) c) they have the same attributes (file attributes can be "read only" [you cannot alter the file], "system" [a file which Windows thinks is central to your operating system], "hidden" [one which is so important that Bill Gates may be reluctant to even let you know it exists on your disk]) d) they have the same time & date. December, 2015. Page 347</p> <p><span class="badge badge-info text-white mr-2">363</span> 348 Utility Programs How to do it Specify your 2 files and simply press "Compare". 348 348 349 , rename See also : file chunker , find duplicates 10.6.5 file chunker The point of it The idea is to be able to cut up a big file into pieces, so that you can copy it in chunks e.g. for emailing. Naturally, you may later want to restore the chunks to one file. How to do it: to copy a file Specify your "file to chunk" (the big one you want to copy) 1. 2. Specify your "drive & folder" (where you want to copy the chunks to. Specify the "size of each chunk" 3. 4. Specify whether to "compress while chunking" (compresses the file as it goes along) 5. Press "Copy". How to do it: to restore a file 1. Specify your "first chunk" (the first chunk you made using this program) 2. Specify which folder to "restore to" (where you want the results) 3. Specify whether to "delete chunks afterwards" (if they are not needed) 4. Press "Restore". 349 348 347 See also : compare two files , find duplicates , rename find duplicates 10.6.6 The point of it The idea is to be able to check whether you have files with the same name in different folders. You may often make copies of files and a few weeks later cannot remember where they were. By default this program only checks whether the files it is comparing have the same name but dates and file-size can be compared too. It handles lots of folders, the point being to locate unnecessarily duplicated files or confusing reuse of the same filenames. How to do it Specify your Folder 1 and simply press "Search". Find Duplicates will go through that folder and any sub-folders and will report any duplicates found. December, 2015. Page 348</p> <p><span class="badge badge-info text-white mr-2">364</span> 349 WordSmith Tools Manual Or you can specify 2 different folders (e.g. on different drives) and the process compares one set with the other. Sub-folders to exclude Useful if there are some sub-folders you know you're not interested in. In the example below, any folder whose name ends will be ignored as or _shibs or whose name is demo or examples _old will any sub-folder below it. In the window below, you will find all the duplicates listed with the folder and date. In the example we can see there are two files called ambassador 1.txt in different shakespeare folders. 349 348 347 See also : compare two files , file chunker , rename 10.6.7 rename The point of it To rename a lot of files at once, in one or more folders. You may have files with excessively long names which do not suit certain applications. Or it is a pain to rename a lot of files one by one. December, 2015. Page 349</p> <p><span class="badge badge-info text-white mr-2">365</span> 350 Utility Programs The idea is to rename a set of files with a standard name plus a number. For example suppose you have downloaded a lot of text files containing emails from the Enron scandal, you could rename them Enron001.txt Enron002.txt etc. How to do it Specify your Folder, whether sub-folders will also be processed, and the kinds of file you want to find for renaming. files and all In the screenshot, *.txt;*.xml has been specified, which means all .txt .xml files. has been pressed, too. In the list you can see some of each. Find Files If you typed baby??.doll you'd get all files with the .doll ending as long as the first 4 characters were baby as in baby05.doll, babyyz.doll , etc. Now specify a "mask for new name" and a starting number. The mask can end with a series of # characters standing for numbers. In this screenshot, there are 4 # symbols December, 2015. Page 350</p> <p><span class="badge badge-info text-white mr-2">366</span> 351 WordSmith Tools Manual so after pressing Rename the texts have been renamed Bacon plus an incrementing number formatted to 4 digits. 348 347 348 , find duplicates , file chunker See also : compare two files move files to sub-folders 10.6.8 This function allows you to take a whole set of files in a folder and move them to suitable sub- folders. Example: you have In c:\temp 2001 Jan.txt 2001 Feb.txt 2003 Jan.txt 2003 Feb.txt 2003 March.txt 2003 Oct.txt etc. and you want them sorted by year into different folders. Using the template you will take the first four characters of your files and place each into a AAAA* sub-folder named appropriately. Results contains 2001 Jan.txt, 2001 Feb.txt c:\temp\2001 December, 2015. Page 351</p> <p><span class="badge badge-info text-white mr-2">367</span> 352 Utility Programs and all the others are in c:\temp\2003 Syntax ? = ignore this character A = use this character in the file-name * = use no further characters in the file-name 10.6.9 dates and times Purpose 126 The aim here is to parse your file-names identifying suitable textual file dates and times , where you have incorporated something suitable in the file-name. Suitable dates can be re-used by saving 50 . file-choices as favourites Mask Syntax Folder to process (and optionally its sub-folders) and The procedure reads any file-names in the attempts to parse them. If an indicator is found it will record a suitable date combination. Suitable indicators of textual date are YY or YYYY year, two or four digits (YY=a 20th Century date) December, 2015. Page 352</p> <p><span class="badge badge-info text-white mr-2">368</span> 353 WordSmith Tools Manual MM month DD day * skip all characters until a digit is found The procedure doesn't understand words such as "December" or "Five", it only uses digits. Any character other than Y,M,D,* in the mask simply gets ignored. Output The program will always add each entry found to a simple text file ( File for list of dates ) listing its 48 file-name and adding a suitable date as expected in the auto-date procedure <no date , (or found> if the mask didn't match a valid date). In addition, where the result is 1st January 1980 or later, it will set the file's time and date in the operating system to the date as parsed, so that WordSmith will automatically match the date of the text contents to the date stored on disk. When all files have been processed, the program opens the list of files in Notepad or equivalent. Use 48 it afterwards in the auto-date procedure within file-choosing and save your preferred text files as 50 favourites . Examples Your Mask Date and Time interpreted Source file 20060512 Peter YYYYMMDD 12th May 2006 (first 8 characters used in the mask) monologue.txt YYMMDD 841231.txt 31st December 1984 (20th Century assumed if YY mask used) DDMMYY 311284.txt 31st December 1984 DDMMYYYY 20060512 Peter 20th June, the year 512 AD monologue.txt 20060512 Peter DDMM 20th June of the current year monologue.txt ######YYYYMM Peter 20060512.txt 12th May 2006 (first six characters were ignored, five DD for Peter, one for space) Peter 20060512.txt *YYYY 15 July 2006 (all characters to first digit skipped, then next 4 used for year date) YYYY 1086 Domesday 15 July 1086 (there were only four digits) book.txt YYYYMMDD 1086 Domesday 15 July 1086 (mask had 8 digits but file-name only 4) book.txt YYYY#MM#DD 2006,05/12,10-54.txt 12th May 2006 YYYY MM DD 2006,05/12,10-54.txt 12th May 2006 December, 2015. Page 353</p> <p><span class="badge badge-info text-white mr-2">369</span> 354 Utility Programs 10.6.10 find holes in texts After text files have been copied from one source to another, they may get slightly corrupted with 466 in the stream of text. This utility lets you seek out the texts in your corpus which have got holes corrupted in this way and optionally lets you delete them. If you want to convert the holes to space- characters, use the Text Converter. 10.7 Text Converter 10.7.1 purpose This program does a "Search & Replace", on virtually any number of files. bers of texts and re-formatting them as you prefer, e.g. It is very useful for going through large num taking out unnecessary spaces, ensuring only paragraphs have <Enter> at their ends, changing accented characters, ensuring you have Windows symbols, etc. £ converting text For a simple search-and-replace you can type in the search item and a replacement; for more 362 complex conversions, us e a Conversion File so that Text Converter knows which symbols or 419 , bu t strings to convert. It operates under Windows and saves using the Windo ws character set will convert text using DOS or Windows character sets. You can use it to make your text files suitable for use with your Internet browser. It does a "search and replace" much as in word-processors, but it can do this on lots of text files, any number of strings, not just one. one after the other. As it does so, it can also replace up to 355 Once the conve speci fied, the Text Converter will read each rsion file is prepared and Settings source file and either create a new version or replace the old one, depending on the over-write 355 . setting You will be able to see the details of ho w many instances of each string were found and replaced overall. filtering files 360 And/or you may need to make sure texts which meet certain criteria are put into the right folders . Tip The easiest way to ensure your text files are the way you want, especially if you have a very large number to convert, is to copy a few into a temporary folder and try out your conversion file with the Text Converter. You may find you've failed to specify some necessary conversions. Once you're sure everything is the way you want it, delete the temporary files. December, 2015. Page 354</p> <p><span class="badge badge-info text-white mr-2">370</span> 355 WordSmith Tools Manual 441 355 , The buttons Text Converter Contents See also: 10.7.2 index Explanations 354 What is the Text Converter and what's it for? 355 Getting Started... 365 Convert the text format 360 Filters 364 Sample Conversion File 363 Syntax 362 Conversion File 2 See also : WordSmith Main Index settings 10.7.3 Files 1. Choose (the top left tab). Decide whether you want the program to process sub-folders of the one you choose. There is no limit to the number of files Text Converter can process in one operation. 360 tab, and: Conversion 2. Click on the or Filters 3. Decide whether you want to make copies of the text files, or to over-write the originals. Obviously you must be confident of the changes to choose to over-write; copying however may mean a problem of storage space. Choose between "Within files", "Whole files" or "Extract from files" Within files = make some alterations to specific words in each text file, if found For example, specify what to convert, that is the search-words and what you want them to be replaced with. For a quick conversion you can simply type in a word you want to change and its responsable replacement (e.g. Just one change so that becomes responsible ) or you can 362 Conversion File choose your own pre-prepared . Whole files = make some alterations affecting all the words in each text file 365 E.g. in the Whole Files section you can choose simply to update legacy files in various ways, e.g. by choosing Dos to Windows, Unix to Windows, MS Word .doc to .txt, into Unicode, etc . Or if you want simply to extract some text from your files, you should choose the Extract from files December, 2015. Page 355</p> <p><span class="badge badge-info text-white mr-2">371</span> 356 Utility Programs 359 tab. If you might want some files not to be converted, or simply don't want any conversions but instead 360 to place files in appropriate sub-folders, choose the Filters tab at the top. If you choose Over-write Source texts, Text Converter will work more quickly and use less disk space, but of course you should be quite sure your conversion file codes are right before starting! 358 for details of how the folders get replicated in a copy operation. See copy to Note that some space on your hard disk will be used even if you plan to over-write . The conversion process does its work, then if all is well the original file is deleted, and the new version copied. There has to be enough room in the destination folder for the largest of your new files; it is much quicker for it to be on the same drive as the source texts. If it isn't, your permission will be asked to use the same drive. December, 2015. Page 356</p> <p><span class="badge badge-info text-white mr-2">372</span> 357 WordSmith Tools Manual inserting <Tab>, <Enter> etc December, 2015. Page 357</p> <p><span class="badge badge-info text-white mr-2">373</span> 358 Utility Programs Choose in the listbox and drag to one of the windows to left or right of ->. The string inserted will 362 . conform to the format cutting out a header from each file It can be useful to get a header removed. In the screenshot example, any text which contains </ will get all the beginning of the file up to that point cut out. teiHeader> OK to start; you will see a list of results, as in the screenshot below. Press If you want to stop Text Converter at any time, click on the Stop button or press Escape. Right-click to see the source or the converted result file: 355 . See also: Text Converter Contents Text Converter: copy to If you choose to copy the files you are converting, instead of converting or filtering them in place, which is a lot safer, the new files created will be structured like this. d:\texts\2007\literature c:\temp Suppose you are processing and copying to and suppose d:\texts\2007\literature contains this sort of thing: d:\texts\2007\literature\shakespeare\hamlet.pdf d:\texts\2007\literature\shakespeare\macbeth.pdf ... d:\texts\2007\literature\shakespeare\poetry\sonnet1.pdf d:\texts\2007\literature\shakespeare\poetry\sonnet2.pdf ... d:\texts\2007\literature\french\victor hugo\miserables.pdf d:\texts\2007\literature\french\poetry\baudelaire\le chat.pdf ... you will get c:\temp\shakespeare\hamlet.txt c:\temp\shakespeare\macbeth.txt ... December, 2015. Page 358</p> <p><span class="badge badge-info text-white mr-2">374</span> 359 WordSmith Tools Manual c:\temp\shakespeare\poetry\sonnet1.txt c:\temp\shakespeare\poetry\sonnet2.txt ... c:\temp\french\victor hugo\miserables.txt c:\temp\french\poetry\baudelaire\le chat.txt ... In other words, for each file successfully converted or filtered, any same directory structure beyond d:\texts\2007\literature the starting point ( in the example above) will get appended to the destination. 10.7.4 extracting from files The point of it... The idea is to be able to extract something useful from within larger files. In the example below, I wanted to extract the headlines only from some newspaper text. I knew that the header for each , and I text contained <DAT> (date of publication mark-up) and that the headline ended </HED> wanted only those chunks which contained the phrase . Leading article: The results I got looked like this: <CHUNK "1"><DAT>05 August 2001</DAT> <SOU>The Observer</SOU> <PAG>26</PAG> <HED>Comment: Leading article: Ealing's lessons: Time for steel from the peacemakers</HED></CHUNK> <CHUNK "2"><DAT>05 August 2001</DAT> <SOU>The Observer</SOU> <PAG>26</PAG> <HED>Comment: Leading article: The free market can't house us all: Why Government has to intervene</HED></CHUNK> December, 2015. Page 359</p> <p><span class="badge badge-info text-white mr-2">375</span> 360 Utility Programs <CHUNK "3"><DAT>05 August 2001</DAT> <SOU>The Observer</SOU> <PAG>26</PAG> <HED>Comment: Leading article: What a turn-on: Cat's whiskers are the bee's knees</HED></CHUNK> Settings containing : all non-blank lines in this box will be required. Leave it blank if you have no requirement that the chunk you want to extract contains any given word or phrase. chunk mark er : Leave blank, otherwise each chunk will be marked up as in the example above, if 343 it begins with . The reason for this marker is to enable subsequent splitting and ends with > < . 10.7.5 filtering: move if This function allows you to specify a word or phrase, look for it in each file, and if it's found move that file into a new folder. The point of it ... Suppose you have a whole set of files some of which contain dialogues between Pip and Magwich, others containing references to the Great Wall of China or the anatomy of fleas. You want those with the Pip-Magwich dialogues and you want them to go into a folder called . Expectations How to do it 1. Click on the tab (at the top). Filters 2. Now the Activated checkbox. December, 2015. Page 360</p> <p><span class="badge badge-info text-white mr-2">376</span> 361 WordSmith Tools Manual 3. Specify a word or phrase the text must contain. This is case sensitive. In this case Magwich has been specified. 4. Choose whether that word or phrase has to be found · anywhere in the text, · anywhere before some other word or phrase, or · between 2 different words or phrases. 5. Decide what happens if the conditions are met: · nothing, i.e. ignore that text file · copy to a certain folder, or · move to that folder, or · delete the file (careful!). Action options You can also decide to build sub-folder(s) based on the word(s) or phrase(s) you chose in #3. · (The idea is to get your corpus split up into useful sub-folders whose names mean something to build sub-folder is not checked everything goes into the copy to or move to folder. you.) If And you may have the program add (useful if as with the .txt · BNC World Edition there are no file extensions) and/or convert it to Unicode. · You could also have any texts not containing the word Magwich copied to a specified folder. buttons are specific to those two editions of the BNC and The load BNC World and load BNC XML December, 2015. Page 361</p> <p><span class="badge badge-info text-white mr-2">377</span> 362 Utility Programs read text files with similar names which you will find in your folder. Documents\wsmith6 355 Text Converter Contents . See also: Convert within the text file 10.7.6 Your choices here are 5: 1. cut out a header and/or 2. make one change only 3. insert numbering 4. replace some problem characters 364 to see. 5. use a script to determine a whole set of changes. There is an sample conversion file Prepare your Text Converter conversion file using a plain text editor such as Notepad. Documents\wsmith6\convert.txt as a basis. You could use 419 in your original files, use the DOS editor to prepare the If you have accented characters conversion file if they were originally written under DOS and a Windows editor if they were written December, 2015. Page 362</p> <p><span class="badge badge-info text-white mr-2">378</span> 363 WordSmith Tools Manual in a Windows word-processor. Some Windows word processors can handle either format. There can be any number of lines for conversion, and each one can contain two strings, delimited with " " quotes, each of up to 80 characters in length. The Text Converter makes all changes in order, as specified in the Conversion File. Remember one alteration may well affect subsequent ones. Alterations that increase the original file Most changes reduce the size of an original. But Text Converter will cope even if you need to increase the original file -- as long as there's disk space! Tip To get rid of the <Enter> at line ends but not at paragraph ends, first examine your paragraph ends to see what is unique about them. If for example, paragraphs end with two <Enters>, use the following lines in your conversion file: "{CHR(13)}{CHR(10)}{CHR(13)}{CHR(10)}" -> "{%%}" (this line replaces the two <Enters> with {%%} .) (It could be any other unique combination. It'll be slightly faster if you make the search and the replacement the same length, as in this case, 4 characters) "{CHR(13)}{CHR(10)}" -> " " (this line replaces all other <Enters> with a space, to keep words separate) "{%%}" -> "{CHR(13)}{CHR(10)}{CHR(13)}{CHR(10)}" (this line replaces the {%%} combination with <Enter><Enter>, thus restoring the original paragraph structure) /S (this line cuts out all redundant spaces) 364 363 355 , Text Converter Contents . , syntax sample conversion file See also: syntax 362 The syntax for a Conversion File is: · Only lines beginning / or " are used. Others are ignored completely. Every string for conversion is of the form "A" -> "B". That is, the original string, the one you're · searching for, enclosed in double quotes, is followed by a space, a hyphen, the > symbol, and the replacement string. You can use " (double quotes) and hyphen where you like without any need to substitute them, · in your search or replace " -> " but for obvious reasons there must not be a sequence like string. Removing all tags To remove all tags, choose "<*>" -> "" as your search string. Control Codes Control codes can be symbolised like this: {CHR(xxx)} where xxx is the number of the code. {CHR(10)} {CHR(13)} {CHR(9)} is a tab. To Examples: is a carriage-return, is a line-feed, which comes at the end of paragraphs and sometimes at the end of each line, <Enter> represent {CHR(13)}{CHR(10)} which is carriage-return followed immediately by line-feed. you'd type December, 2015. Page 363</p> <p><span class="badge badge-info text-white mr-2">379</span> 364 Utility Programs 159 Use for {CHR(34)} if you need to refer to double inverted commas. See search-word syntax more. Wildcards The search uses the same mechanism that Concord uses. You may use the same wildcards as in 159 . By default the search-and-replace operates on whole words. Concord search-word syntax Examples: with bk but won't replace books or textbook "book" -> "bk" will replace book book textbook with bk or books or will replace "*book" -> "bk" but won't replace textbooks book "book*" -> "bk" books with bk but won't replace textbook or will replace or textbooks To show a character is to be taken literally, put it in quotes (e.g. "*","<"). See below for use of the / L parameter. Unbounded, case Insensitive, Confirm, redundant Spaces, redundant <Enter>s /C stops to confirm you wish to go ahead before each change. 427 does an unbounded search (ensuring the alteration happens whether there's a word separator /U and then the bathe ). on either side or not) (/U "the" finds other, but also finds restaurant with hotel and /I does a case insensitive search (/I "restaurant" -> "hotel" replaces HOTEL with with Hotel , i.e. respecting case as far as possible). RESTAURANT and Restaurant You can combine these, e.g. /IC "the" -> "this" cuts out all redundant spaces. That is, it will reduce any sequence of two or more spaces to /S one, and it also removes some common formatting problems such as a lone space after a carriage- return or before punctuation marks such as .,; and ). /S can be used on a line of its own or in combination with other searches. cuts out all redundant <Enter>s. That is, it will reduce any sequence of two or more carriage- /E /E can be used on a line of return+line-feeds (what you get when you press Enter or Return) to one. its own or in combination with other searches. /L means both the search and replace strings are to be taken as literal. (Normally a sequence like < > are mark-up signals and # and * are <#*> would need quotes around each character because which is tricky! Put "<""#""*"">" special wildcard characters, thus /L at the start of the line to avoid this.) Documents\wsmith6 \convert.txt to see examples in use. See 355 See also: Text Converter Contents . sample conversion file 422 and paste it into notepad. You could copy all or part of this to the clipboard [ comment line -- put whatever you like here, it'll be ignored ] [ first a spelling correction ] December, 2015. Page 364</p> <p><span class="badge badge-info text-white mr-2">380</span> 365 WordSmith Tools Manual "responsable" -> "responsible" [ now let's change brackets from < > to [ ] and { } to ( ) ] "*<*" -> "[" "*>*" -> "]" "*}*" -> ")" "*{*" -> ")" /S [ that will clear all redundant spaces] is a sample conversion file for use with British The file Documents\wsmith6\convert.txt National Corpus text files. 355 See also: Text Converter Contents . 10.7.7 Convert format of entire text files To convert a series of whole text files from one format to another, choose one or more of these options: December, 2015. Page 365</p> <p><span class="badge badge-info text-white mr-2">381</span> 366 Utility Programs These formats allow you to convert into formats which will be suited to text processing. into Unicode: ... this is a better standard than ASCII or ANSI as it allows many more characters to be used, 430 suiting lots of languages. See Flavours of Unicode . December, 2015. Page 366</p> <p><span class="badge badge-info text-white mr-2">382</span> 367 WordSmith Tools Manual TXT file extensions: ... makes the filename end in .txt (so that Notepad will open without hassling you; Windows was baffled by the empty filenames of the BNC editions prior to the XML edition). If you choose this you will be asked whether to force .txt onto all files regardless, or only ones which have no file extension at all. curly quotes etc.: ... changes any curly single or double quote marks or apostrophes into straight ones, ellipses into three dots, and dashes into hyphens. (Microsoft's curly apostrophes differ from straight ones.) removing line-breaks ... replaces every end of line line-break with a space. Preserves any true paragraph breaks, which -- in other words two line-breaks one <Enter><Enter> you must ensure are defined (default = after the other with no words between them). 371 362 372 368 , MS Word , Word/Excel/PDF , convert within text files See also: Mark-up , non-Unicode 444 , Guide to handling the BNC documents December, 2015. Page 367</p> <p><span class="badge badge-info text-white mr-2">383</span> 368 Utility Programs Mark-up changes removing all tags would convert The<DT><the> TreeTagger<NP><TreeTagger> is<VBZ> The ... into Treetagger is . Can plough through a copy of the whole BNC, for example, and make it readable. If you have specified a header string it will cut the header up to that point too. Uses the . selected span for looking for the next > when it finds a < word_TAG to <TAG>word The Helsinki corpus can come tagged like this (COCOA tags) the_D occasion_N of_P her_PRO$ father's_N$ death_N and this conversion procedure will change it to <D>the <N>occasion <P>of <PRO$>her <N$>father's <N>death Note: this procedure does not affect underscores within existing <> markup. word_TAG to word<TAG> converts text like December, 2015. Page 368</p> <p><span class="badge badge-info text-white mr-2">384</span> 369 WordSmith Tools Manual It_PP is_VBZ easy_JJ or Stanford Log-linear POS tagger output like It/PP is/VBZ easy/JJ to It<PP> is<VBZ> easy<JJ> You will have to confirm which character such as _ or / divides the word from the tags. Note: before it starts, it will clear out any existing <> markup. swap tag and word converts text like It<PP> is<VBZ> easy<JJ> to <PP>It <VBZ>is <JJ>easy or vice-versa. In other words swapping the order of tags and words. The procedure effects a swap at each space in the non-tagged text sequence. Any tags which do not qualify a neighbouring word but for example a whole sentence or a paragraph should not be swapped, so fill in the box to the right with any such tags, using commas to separate <s>,</s>,<p>,</p> them, e.g. from column tagged The Stuttgart Tree Tagger produces output like this separating 3 aspects of each word with a <tab>: word pos lemma The the DT TreeTagger NP TreeTagger be is VBZ easy easy JJ to TO to VB use use SENT . . You will need to supply a template for your conversion. Template syntax and examples: 1. Any number in the template refers to the data in that column number. ( The is in column 1 above, DT in column 2 of the original.) 2. Only columns mentioned in the template get used in the final output. 3. Separate columns in your template with a / slash. 4. You can add letters and symbols if you like. 5. A space will get added after each line of your original. Examples: The<the><DT> · the template 1/<3>/<2> will produce with the cases above December, 2015. Page 369</p> <p><span class="badge badge-info text-white mr-2">385</span> 370 Utility Programs etc. Treetagger<Treetagger><NP> is<be><VBZ> the template <POS="2">/1 will produce · <POS="DT">The <POS="NP">Treetagger <POS="VBZ">is etc. It will present the text as running text, no longer in columns, but with a break every 80 characters. entities to characters é ... converts HTML or XML symbols which are hard to read such as é to ones like . Specify these in a text file. There is a sample file pre-prepared for you, in html_entities.txt, your Documents\wsmith6 folder; look inside and you'll see the syntax. XML simplification The idea is to remove any mark-up in XML data which you really do not wish to keep. For example, in the BNC XML edition you might wish to keep only the pos="*" mark-up and remove the c5 and hw attributes. To do so, press the Options button and complete for example like this: resulting in a saved XML file with a structure like this: December, 2015. Page 370</p> <p><span class="badge badge-info text-white mr-2">386</span> 371 WordSmith Tools Manual The procedure simply looks for all sections which begin and end with the required strings and delete any sections in between which contain the strings you specify in the remove these section. No further account of context is taken. Note that the order of attributes is not important, so we could have specified first. c5="*" 365 See also: Convert Entire Texts Word, Excel, PDF from MS Word or Excel to .txt December, 2015. Page 371</p> <p><span class="badge badge-info text-white mr-2">387</span> 372 Utility Programs 444 This is like using "Save as Text" in Word or Excel. Handles .doc, .docx (Office 2007) and .xls files. from PDF ... into plain text. Not guaranteed to work with every .PDF as formats have changed and some are complex. To convert PDFs to plain text can be extremely tricky even if you own a licensed copy of the Adobe software (Adobe themselves created the PDF format in 1993). That is because PDF is a representation of all the dots, colours, shapes and lines in a document, not a long string of words. It can be very hard with an image of the text, to determine the underlying words and sentences. A second problem is that PDFs can be set with security rights preventing any copying, printing, editing etc. Other formats (.TXT, .DOC, .DOCX, .XML, .HTML, .RTF etc.) are OK in principle as they do not contain only an image but also store within themselves the words and sentences. 365 See also: Convert Entire Texts non-Unicode Text December, 2015. Page 372</p> <p><span class="badge badge-info text-white mr-2">388</span> 373 WordSmith Tools Manual Codepage conversion This allows you to convert 1-byte based formats, for example from Chinese Big5 or GB2312, Japanese ShiftJis, Korean Hangul to Unicode. 365 See also: Convert Entire Texts Other changes Unix to Windows Unix-saved texts don't use the same codes for end-of-paragraph as Windows-saved ones. encrypting using ... allows you to encrypt your text files. You supply your own password in the box to the right. When WordSmith processes your text files, e.g. when running a concordance it will restore the text as needed but otherwise the text will be unintelligible. Encrypted files get the file extension .WSencrypted . For example, if your original is wonderful.txt the copy will be wonderful.WSencrypted . Requires the safer copy to button above to be selected. lemmatising using 274 . If for example your source text has " " ... converts each file using a lemma file she was tired BE -> AM, WAS, WERE, IS, ARE , then you will get " she be tired " and your lemma file has Was she tired? " you'll get " Be she in your converted text file. Where your source text has " tired? " December, 2015. Page 373</p> <p><span class="badge badge-info text-white mr-2">389</span> 374 Utility Programs SRT Transcripts 215 . If using TED files converts SRT files such as those obtained from TED Open Translation Project you may need to add some seconds for the standard TED lead-in. Example These text files in English (.en), Spanish (.es) Italian (.it) and Japanese (.ja) originally downloaded got converted thus: 212 To enable Concord to play the .mp4 file to the same I had to change EloraHardy_2015-480p.mp4 Elora Hardy Magical houses made of bamboo.mp4 . Note the file sizes are bigger (converted title into Unicode) and the file-names no longer have two dots. This is so that Concord will find a match between its file-name and the transcripts in these 4 languages. 365 See also: Convert Entire Texts Text Converter: converting BNC XML version 10.7.8 The British National Corpus is a valuable resource but has certain problems as it comes straight off the cdrom: · it is in Unix format · it has entities like é to represent characters like é · its structure is opaque and file-names mean nothing You will find it much easier to use if you · convert it to Unicode · filter the files to make a useful structure December, 2015. Page 374</p> <p><span class="badge badge-info text-white mr-2">390</span> 375 WordSmith Tools Manual as explained at http://lexically.net/wordsmith/Handling_BNC/index.html The easiest way to do that is in two stages. Conversion: After choosing the texts, December, 2015. Page 375</p> <p><span class="badge badge-info text-white mr-2">391</span> 376 Utility Programs and when you press OK you'll be asked something like this December, 2015. Page 376</p> <p><span class="badge badge-info text-white mr-2">392</span> 377 WordSmith Tools Manual After the work is done you will see the BNC texts copied to a similar structure (in our case stemming from j:\temp ) Filter Choose the converted texts in the first window: de-activate conversion, December, 2015. Page 377</p> <p><span class="badge badge-info text-white mr-2">393</span> 378 Utility Programs and choose filtering like this: Eventually you should get folder structures like this: December, 2015. Page 378</p> <p><span class="badge badge-info text-white mr-2">394</span> 379 WordSmith Tools Manual 10.8 Viewer and Aligner 10.8.1 purpose This is a program for showing your text or other files, highlighting words of interest. You will see them in plain text format, with tag mark-up shown or hidden as in your tag settings. There are a 387 390 settings and options you can change. number of 381 aligned version of 2 or more texts, with alternate sentences or Its main use is to produce an paragraphs from each of them. December, 2015. Page 379</p> <p><span class="badge badge-info text-white mr-2">395</span> 380 Utility Programs 387 390 381 , an example of aligning , Viewer & Aligner options See also: Viewer & Aligner settings index 10.8.2 Explanations 379 What is the Viewer & Aligner and what's it for? 381 an example of aligning 390 Settings 387 Viewing Options 455 What to do if it doesn't do what I want... 392 Searching for Short Sentences 389 Joining/Splitting 384 Aligning a Dual Text 391 Finding translation mis-matches 390 The technical side... 2 see also : WordSmith Main Index December, 2015. Page 380</p> <p><span class="badge badge-info text-white mr-2">396</span> 381 WordSmith Tools Manual 10.8.3 aligning with Viewer & Aligner This feature aligns the sentences in two files. Translators need to study differences between an original and a translation. Other linguists might want it to study differences between two versions of 81 a text in the same language. Students of different languages can use it as they might use dual language readings, to study closely the differences e.g. in word order. It helps you produce a new text which consists of the two files, with sentences interspersed. That way you can compare the translation with the original. Example Der Knabe sagte diesen Gedank en dem Schwesterchen, und diese folgte. Allein auch Original : (from Stifter's der Weg auf den Hals hinab war nicht zu finden. So k lar die Sonne schien, ... Bergkristall, translated by Harry Steinhauer, in German Stories, Bantam Books 1961) Translation: The boy communicated this thought to his sister and she followed him. But the road down the neck could not be found either. Though the sun shone clearly, ... Aligned text: <G1> Der Knabe sagte diesen Gedanken dem Schwesterchen, und diese folgte. <E1> The boy communicated this thought to his sister and she followed him. <G2> Allein auch der Weg auf den Hals hinab war nicht zu finden. <E2> But the road down the neck could not be found either. <G3> So klar die Sonne schien, ... <E3> Though the sun shone clearly, ... An aligned text like this helps you identify additions and omissions, normalisations, style changes, word order preferences. In this case the translator has chosen to avoid very close equivalence. 381 384 , Aligning and moving See also: an example of aligning example of aligning 10.8.4 How to do it -- a Portuguese and English example 387 ), and checking its 1. Read in your Portuguese text (eg. Hora da Estrela.TXT 389 392 " to help identify the way you like. Try "Unusual Lines sentences and paragraphs break oddities. 2. Save it December, 2015. Page 381</p> <p><span class="badge badge-info text-white mr-2">397</span> 382 Utility Programs and it will (by default) get your filename .VWR, eg. Hora da Estrela.VWR. (It is important to do that, as a .VWR file knows the language, colour settings etc. and the cleaning up work you've done, whereas the .TXT file is just the original text file you read in.) 3. Do the same steps 1 and 2 for your English text -- you will now have e.g. Hour of the . Star.VWR Hora de la Estrella.txt giving Hora 4. You could if desired repeat with the Spanish -- , (or German, Russian, Arabic, etc.). de la Estrella.VWR 5. Now open your Portuguese Hora da Estrela.VWR File | Merge 6. and then December, 2015. Page 382</p> <p><span class="badge badge-info text-white mr-2">398</span> 383 WordSmith Tools Manual December, 2015. Page 383</p> <p><span class="badge badge-info text-white mr-2">399</span> 384 Utility Programs ) as the format. 7. Finally File | Save choosing Aligned files ( .ALI 10.8.5 aligning and moving You may well want to alter sentence ordering. The translator may have used three sentences where the original had only one. You can also merge paragraphs. December, 2015. Page 384</p> <p><span class="badge badge-info text-white mr-2">400</span> 385 WordSmith Tools Manual adjusting by dragging with the mouse To merge sentences or paragraphs, simply grab and drag it up to the next one above in the same language. Or use the Join button. Or press F4. To split a sentence or paragraph, choose the Split button or press Ctrl+F4. 101 you will want to save (Ctrl+F2) the results . Finally 380 See also: Viewer & Aligner contents 10.8.6 editing While Viewer & Aligner is not a full word-processor, some editing facilities have been built in to help deal with common formatting problems: Split: allows you to choose where a line should be divided in two. · Join down , Join up: these buttons merge a line with another one. You can achieve this also by · simply dragging. · Cut line: removes any blank lines. Trim: this goes through each sentence of the text, removing any redundant spaces -- where there · are two or more consecutive spaces they will be reduced to one. · Cut & Trim All does these actions on the whole text. · Edit opens up a window allowing you to edit the whole of the current sentence or paragraph. · Heading: allows you to treat a line as a heading, and if so makes it look bold. 392 Find unusual lines · : this identifies cases where a sentence or paragraph does not start with a 389 it to the one above, or where a line is capital letter or number -- you will probably want to join unusually short, etc. 392 Find short lines · You will then want to save (Ctrl+F2) your text. You can also: · open a new file for viewing (you can open any number of text files within Viewer & Aligner) 422 · copy a text file to the clipboard (select, then press Control+Ins) print the whole or part of the currently active text file · · search for words or phrases (press F12) languages 10.8.7 Each Viewer file ( .VWR ) has its own language. Each Aligner file ( .ALI ) has one language for each of the component sections. (They could all be the same, if for example you were analysing various different editions of a Shakespeare play they'd all be English.) The set of languages available is that 83 defined using the Languages Chooser . 84 You can change the language to one of your previously defined languages using the drop-down list. Here is an example where a Portuguese language plain TXT text file was opened and the default language was English. December, 2015. Page 385</p> <p><span class="badge badge-info text-white mr-2">401</span> 386 Utility Programs When Portuguese was chosen in the drop-down list, and agreed to, it was possible to save the result (as a .VWR file) so that henceforth it would know which language to use. 10.8.8 numbering sentences & paragraphs You can use the Viewer & Aligner to make a copy of your text with all the sentences and/or . paragraphs tagged with <S> and <P> 102 To do this, simply read in the text file in, choose Edit | Insert Tags , then save it as a text file . 380 See also: Viewer & Aligner contents December, 2015. Page 386</p> <p><span class="badge badge-info text-white mr-2">402</span> 387 WordSmith Tools Manual 10.8.9 options Mode: Sentence/Paragraph This switches between Sentence mode and Paragraph mode. In other words you can choose to view your text files with each row of the display taking up a sentence or a paragraph. Likewise, you can make an dual aligned text by interspersing either paragraphs or sentences. The 389 other functions (e.g. ) work in the same way in either mode. joining, splitting Colours The various texts in your aligned text will have different colours associated with them. Colours can button. be changed using the 10.8.10 reading in a plain text In Viewer and Aligner, choose and select your plain text file. File | Open, and you may see this sort of thing in Sentence view , December, 2015. Page 387</p> <p><span class="badge badge-info text-white mr-2">403</span> 388 Utility Programs , or in Paragraph view 389 Edit it, as necessary, e.g. splitting or merging paragraphs or sentences. There's a taskbar with buttons to help above the text. Ensure the language is right: December, 2015. Page 388</p> <p><span class="badge badge-info text-white mr-2">404</span> 389 WordSmith Tools Manual And save it as a .VWR file: . 381 See also: example of aligning joining and splitting 10.8.11 Joining The easiest way to join two sentences is simply to drag the one you want to move onto its ) neighbour above. Or select the lower of the two and press F4 or use the button ( In this example, sentence 60 in Portuguese got represented as two sentences, 60 and 61, in English. Splitting in two To split a sentence, press . You will get a list of the words. Click on the word which should end the sentence, then press OK. example December, 2015. Page 389</p> <p><span class="badge badge-info text-white mr-2">405</span> 390 Utility Programs This will insert the words which follow ( I need others etc.) into a new line below. 380 Viewer & Aligner contents See also: settings 10.8.12 1. What constitutes a "short" sentence or paragraph (default: less than 25 characters) 2. Whether you want to do a lower-case check when Finding Unusual Lines The settings are standard ones found in most of the Tools: 60 Colours 78 Font 80 Printing 124 Text Characteristics 113 Review all Settings technical aspects 10.8.13 When is a sentence not a sentence? There is no perfect mechanical way of determining sentence-breaks. For example, a heading may well have no final full stop but would normally not be considered part of the sentence which follows it. And a sentence may often have no final full stop, if what follows it is a list of items. The algorithm used by Viewer & Aligner is: a sentence ends when it meets the requirements December, 2015. Page 390</p> <p><span class="badge badge-info text-white mr-2">406</span> 391 WordSmith Tools Manual 426 . The same routine is used as in WordList. explained in the definition of a sentence : Consider this chunk from A Tale of Two Cities "Wo-ho!" said the coachman. "So, then! One more pull and you're at the top and be damned to you, for I have had trouble enough to get you to it! - Joe!" Viewer & Aligner will mistakenly consider Joe! as a separate sentence, but handles "Wo- - ho!" said the coachman. as one: though the program would split it in two if the word after had a capital lettter (e.g. in Wild Bill, the coachman, said. ) ho! Viewer & Aligner cannot therefore be expected to handle all sentence boundaries exactly as you I saw Mr. Smith. would. ( would be considered two sentences; several headings may be bundled 392 together as one sentence.) For this reason you can choose Find Short Sentences to seek out any odd one-word sentences. 380 See also: Viewer & Aligner contents 10.8.14 translation mis-matches Viewer & Aligner can help find cases where alignment has slipped (one sentence having been . This searches translated as two or three). One method is to use the menu item Match by Capitals is mentioned in sentences 25 of the Paris for matching proper nouns in the two versions: if say source text and not in sentence 25 of the translation but in sentence 27, it is very likely that some slippage has occurred. Viewer & Aligner will search forwards from the current text sentence on, and will tell you where there's a mis-match. You should then search back from that point to find where the sentences start to diverge. It may be useful to sample every 10 or every 20 to speed up the search for slippage. 389 389 and/or edit the text as appropriate, then save it. When you find the problem, un-join or join 390 392 , Finding unusual sentences See also: Viewer & Aligner The technical side... , 380 contents troubleshooting 10.8.15 Can't see the whole sentence or paragraph to "auto-size" the lines in your display. This adjusts line heights according to the current Press highlighted column of data. Can't see the whole text file Press to "refresh" the display. Don't like the colours December, 2015. Page 391</p> <p><span class="badge badge-info text-white mr-2">407</span> 392 Utility Programs Change colours using . The colours initially used for each language version in the dual-language window are the same colours as used for primary sorting and secondary sorting in Concord . 380 Viewer & Aligner contents See also: unusual lines 10.8.16 It can be useful to seek unusually short sentences to see whether your originals have been handled as you want. Because Viewer & Aligner uses full stops, question marks and exclamation marks as sentence-boundary indicators, you will find a string like "Hello! Paul! Come here!" is broken into 3 very short sentences. Depending on your purposes you may wish to consider these as one sentence, e.g. if a translator has translated them as one ("Oi, Paulo, venha cá!") . This function can also find lower-case lines: where a sentence or paragraph does not start with a capital letter or number -- you will probably want to join it to the one above. This problem is common if the text has been saved as "text only with line breaks" (where an <Enter> comes at the end of each line whether or not it is the end of a paragraph.) Seeking Use the Find Unusual Toolbar menu item ( ) and then press Start Search . Viewer & Aligner will 389 go to the next possibly problematic sentence or paragraph and you will probably want to join it by pressing Join Up (to the one above), Join Down, or Skip. December, 2015. Page 392</p> <p><span class="badge badge-info text-white mr-2">408</span> 393 WordSmith Tools Manual "Case check" switches on or off the search for lower-case sentence starts. The number (25 in the example above) is for you to determine the number of characters counting as a short sentence or paragraph. 390 391 390 , Viewer & , The technical side... See also: Settings , Finding translation mis-matches 380 Aligner contents WSConcGram 10.9 aims 10.9.1 394 , essentially related pairs, triplets, quadruplets (etc.) of words A program for finding concgrams which are related. December, 2015. Page 393</p> <p><span class="badge badge-info text-white mr-2">409</span> 394 Utility Programs 395 395 402 394 See also : definition of concgram , viewing the , running WSConcGram , settings , filtering 397 . output definition of a concgram 10.9.2 For years it has been easy to search for or identify consecutive clusters (n-grams) such as AT THE or TERM TIME . It has also been possible to find non-consecutive END OF, MERRY CHRISTMAS 180 by adapting searches to find context words TEA of within the horizons STRONG linkages such as 202 . The concgram procedure takes a whole corpus of text and finds all sorts of combinations like the ones above, whether consecutive or not. Cheng, Greaves & Warren (2006:414) define a concgram like this For our purposes, a ‘concgram’ is all of the permutations of constituency variation and positional variation generated by the association of two or more words. This means that the associated words comprising a particular concgram may be the source of a number of ‘collocational patterns’ (Sinclair 2004:xxvii). In fact, the hunt for what we term ‘concgrams’ has a fairly long history dating back to the 1980s (Sinclair 2005, personal communication) when the Cobuild team at the University of Birmingham led by Professor John Sinclair attempted, with limited success, to devise the means to automatically search for non-contiguous sequences of associated words. © ) program was "a search-engine, Essentially what they were seeking in developing the ConcGram ( which on top of the capability to handle constituency variation (i.e. AB, ACB), also handles positional variation (i.e. AB, BA), conducts fully automated searches, and searches for word associations of any size." (2006:413) WSConcGram is developed in homage to this idea. December, 2015. Page 394</p> <p><span class="badge badge-info text-white mr-2">410</span> 395 WordSmith Tools Manual 402 395 395 417 See also: bibliography , settings , running WSConcGram , viewing the output , filtering 397 . 10.9.3 settings The settings are found in the main Controller. 10.9.4 generating concgrams . File | New To start, as usual, choose Getting Started In the window, first choose an existing Index, as here where an index based on the works of Dickens has been selected. December, 2015. Page 395</p> <p><span class="badge badge-info text-white mr-2">411</span> 396 Utility Programs To generate the concgrams, the program will then need to build some further files based on the existing index files: There are two steps simply because there's a lot of work if the original index is large. You can stop after the first stage and resume the next day if you wish. With a modern PC and a source text corpus of only a few million words, though, it should be possible to generate the files in a matter of a few minutes. Build steps As you see above, some large additional files have been generated at the end of the two marked on the buttons in the top window. All items which are found together at least as often as set in the Index settings (here 5 times) will be saved as potential members of each concgram. 397 Now, choose Show to view the results. (Or, as usual, right-click the main WSConcgram window and choose last file ). December, 2015. Page 396</p> <p><span class="badge badge-info text-white mr-2">412</span> 397 WordSmith Tools Manual 10.9.5 viewing concgrams When you first open a concgram file created by WSConcGram, it will look something like this one 395 It'll appear (by default) in frequency order as set in the settings but you can sort it by pressing the Word and Freq headers, and can search for items using the little box above the list. December, 2015. Page 397</p> <p><span class="badge badge-info text-white mr-2">413</span> 398 Utility Programs PIP (the hero of Great To get a detailed set of concgrams, double-click an item such as ), or drag it to the list-box above. Then press the concgram button beside that. Expectations You then get a tree view like this where similar items are grouped. Each branch of the tree shows how many sub-items and how many items of its own it has. The other controls are used for suspending lengthy processing ( ) changing from a tree-view to a 402 ), for filtering ( ), clearing filters ( ), and showing more or less of list, for concordancing ( the tree ( ). So if you prefer a plain list, click as Tree to view like this: December, 2015. Page 398</p> <p><span class="badge badge-info text-white mr-2">414</span> 399 WordSmith Tools Manual You may if you like select several items like this: December, 2015. Page 399</p> <p><span class="badge badge-info text-white mr-2">415</span> 400 Utility Programs but do note that the concgrams will have to contain all of the words selected. 402 After filtering appropriately and pressing the Concordance button December, 2015. Page 400</p> <p><span class="badge badge-info text-white mr-2">416</span> 401 WordSmith Tools Manual If you right-click and choose Show Details you'll get to see the details of any section of the tree you have selected: December, 2015. Page 401</p> <p><span class="badge badge-info text-white mr-2">417</span> 402 Utility Programs where you see the various forms and the filename(s) they came from. 10.9.6 filtering concgrams In order to select which items are "associated", we need some sort of suitable statistical procedures. The members of each concgram are at present merely associated by co-occurring at 395 least a certain number of times as explained in generating them The Filtering settings in the Controller allow you to specify, for example, that you want to see only those which are associated with a MI (mutual information) score of 2.0 or a Log Likelihood score of 3.0. December, 2015. Page 402</p> <p><span class="badge badge-info text-white mr-2">418</span> 403 WordSmith Tools Manual Ensure the statistics you need are checked and set to suitable thresholds, and decide whether all the thresholds have to be met (in the case above both MI and Log Likelihood would have to score 3.0 at least) or any of them (in the case above MI at 3.0 or above or Log Likelihood at 3.0 or above). You can also optionally insist on certain words being in your filtered results. When you press the filter button ( ), you will see something like this: December, 2015. Page 403</p> <p><span class="badge badge-info text-white mr-2">419</span> 404 Utility Programs where the items which meet the filter requirements are separated out and selected ready for concordancing; any others are hidden. To the right you see that the head-word CAESAR here relates to AND HE, HER, I , ANTONY etc. above the thresholds set. 10.9.7 exporting concgrams With concgram data loaded, you may wish to export it to a plain text file which can be imported into 307 Excel or imported into a WordSmith word-list . Choose Compute | WordList and you will be offered choices like these. The suggested filename is based on your concgram data. December, 2015. Page 404</p> <p><span class="badge badge-info text-white mr-2">420</span> 405 WordSmith Tools Manual 10.10 Character Profiler 10.10.1 purpose The point of it... Character Profiler , a tool to help find out which characters are most frequent in a text or a set of texts. The purpose could be to check out which characters are most frequent (e.g. in normal English text the letter E followed by T will be most frequent as shown below), or it could be to check whether your text collection contains any oddities, such as accented characters or curly apostrophes you weren't expecting. The first 32 codes used in computer storage of text are "control characters" such as tabs, line-feeds and carriage-returns. A plain .txt version of a text should only contain letters, numbers, punctuation and tabs, line-feeds and carriage-returns -- if there are other symbols you do not file which is really an old WordPerfect or Word .doc in disguise. .txt recognise you may have a It would enable you to discover the most used characters across languages, as in this screenshot: For further details see http://lexically.net/downloads/corpus_linguistics/1984_characters.xls . 10.10.2 profiling characters How to do it 1. Choose one or more texts or a folder. You can type in a complete filename (including drive and folder), and can use wildcards such as *.txt , or you can browse to find your text or folder. 2. If you want to study one text only, just choose one text, but you may choose a whole folderful or more by using the "sub-folders too" option. . 3. Press Analyse December, 2015. Page 405</p> <p><span class="badge badge-info text-white mr-2">421</span> 406 Utility Programs Source Text The display shows details of your selected text, and if you click the tab you can see the original text. (If you have analysed a whole set of text files the Source Text tab will show only that last one.) Legend code the Unicode code of character the character distinguishing punctuation, digits, letters type % percentage of the total number of characters in the text(s) freq. number of occurrences of that character <Tab> etc. control characters indicated in red. 1st Position number of each letter-character occurring in word-initial position 2nd number found in second position in any word etc. Note that 8th will only be able to count letter frequencies for words at least 8 letters long, while 1st or 2nd will handle nearly all words. December, 2015. Page 406</p> <p><span class="badge badge-info text-white mr-2">422</span> 407 WordSmith Tools Manual Sort Click the header to sort the data: The letter E (upper and lower case merged) here represents nearly ten percent of all letters, closely followed by T . If sorted by 1st position in the word, however, in frequency. Presumably the ranking the letter E comes after T,A,I,S,O,W,C,B,P,H,M and F of T reflects the frequency in English of the and A of a . December, 2015. Page 407</p> <p><span class="badge badge-info text-white mr-2">423</span> 408 Utility Programs Copy Copies the data to the clipboard, ready to be pasted for example into Excel. 408 See also: settings 10.10.3 profiling settings The top two boxes allow you to choose a font for your display. Most fonts can only represent some of the Unicode characters, so you may need to experiment to determine which is best for your language. (Character Profiler translates any text into Unicode whether or not it is in Unicode originally, and tells you which form it is in on the Results tab.) Header to cut If you've typed something in here such as </Header> , the program treats all the text before that as a header to be excluded from analysis.. Copy letter characters only Check this one to force the copying to the clipboard to copy only data of letters, ignoring punctuation and digits. Merge lower and UPPER case Check this one to convert all text to upper case. December, 2015. Page 408</p> <p><span class="badge badge-info text-white mr-2">424</span> 409 WordSmith Tools Manual 10.11 Chargrams 10.11.1 purpose The point of it... 426 Chargrams , (sequences of N characters) are a tool to help find out which chargrams most frequent in a text or a set of texts. The purpose could be to check out which chargrams are most frequent e.g. in word-initial position, in the middle of a word, or at the end. These are 3-letter chargrams occurring in word-final position. is a well-known ending in ING English; HAT is is a frequent 3-letter sequence at the end of words too. How does it work? Chargrams are computed by taking only the valid characters of text. If a text contained "In 1845 there was a princess", the 3-character chargrams considered would be THE, HER, ERE, WAS, PRI, RIN, INC, NCE, CES, ESS . The positions are computed in relation to the original words, so is word-initial while ESS is word-final, and RIN is medial. THE If including punctuation, the sequences would include IN_, N_1, _18, 184 etc. too. 10.11.2 chargram procedure How to do it Choose File | New in the Chargrams menu. Choose your texts as with the other Tools. December, 2015. Page 409</p> <p><span class="badge badge-info text-white mr-2">425</span> 410 Utility Programs Then press the button to make a chargram list. 413 See also : settings . display 10.11.3 The display is similar to that in the WordList tool. Contexts This column shows the word-contexts for each chargram. You can double-click to see the whole list. December, 2015. Page 410</p> <p><span class="badge badge-info text-white mr-2">426</span> 411 WordSmith Tools Manual Sorting You can sort by clicking a header. This offers you two of the columns to sort on, a primary sort and then where values on the primary sort are the same a secondary sort. In this example the user is choosing the number of texts in descending order and then word-position in ascending order. This gave the following list (extract): December, 2015. Page 411</p> <p><span class="badge badge-info text-white mr-2">427</span> 412 Utility Programs Word-initial chargrams 1-15 and mid-word chargrams 16 onwards all occurred in all 100% of the texts selected. Concordancing As in many other Tools, you can concordance selected items by choosing Compute | Concordance in the menu. In the case above I wondered about the context word cou December, 2015. Page 412</p> <p><span class="badge badge-info text-white mr-2">428</span> 413 WordSmith Tools Manual ... which clearly shows speech reformulation. 413 See also : settings . 10.11.4 settings Settings are found in the main Controller. You can set minimum and maximum token frequencies for the chargrams to be included in the results, a minimum and maximum number of texts they must appear in, the length in characters (e.g. 3 to 4 characters). Context words As the chargrams are selected, note is taken of the word which they are found in. Here you can determine a minimum number of times for each chargram to appear in a given word for that word to be listed, and a maximum number of context words per chargram to be collected. (Storing lots of extra words will use up system memory so a default of 20 or 50 may be reasonable. Word position By default chargrams in all three positions (word-initial, word-medial and word-final) will be collected. If you check the ignore word position box, word positions get merged. Include punctuation December, 2015. Page 413</p> <p><span class="badge badge-info text-white mr-2">429</span> 414 Utility Programs Allows chargrams of all characters (symbols, punctuation etc.) to be included. Spaces get replaced by underscores. Include digits Allows chargrams of digits as well as alphabetic characters. Ignore low-frequency context chargrams This setting allows us to filter out any chargrams which do not occur in many contexts. As shown here, any chargrams not occurring in at least 4 context word types will get eliminated. If, for example, a chargram has been found in 5 context word-types then the chargram is included in your list. But if in only 1 of these it is found occurring at least 4 times (i.e. found in the same context word-type recurring in the texts at least 4 times), you will see one context word only in the contexts column 410 See also: chargrams display December, 2015. Page 414</p> <p><span class="badge badge-info text-white mr-2">430</span> WordSmith Tools Manual Reference Section XI</p> <p><span class="badge badge-info text-white mr-2">431</span> 416 Reference Reference 11 11.1 acknowledgements WordSmith Tools has developed over a period of years. Originally each tool came about because I wanted a tool for a particular job in my work as an Applied Linguist. Early versions were written for DOS, then Windows Ô came onto the scene. One tool, Concord , had a slightly different history. It developed out of MicroConcord which Tim Johns and I wrote for DOS and which Oxford University Press published in 1993. Ô Pascal with the time-critical sections in The first published version was written in Borland Assembler. Subsequently the programs were converted to Delphi Ô 16-bit; this is a 32-bit only version written in Delphi XE and still using time-critical sections in Assembler. I am grateful to · lots of users who have made suggestions and given bug reports, for their feedback on aspects of the suite (including bugs!), and suggestions as to features it should have. · generations of students and colleagues at the School of English, University of Liverpool, the MA Programme in Applied Linguistics at the Catholic University of São Paulo, colleagues and students at Aston University. · Audrey Spina, Élodie Guthmann and Julia Hotter for their help with the French & German versions of WS 4.0; Spela Vintar's student for Slovenian; Zhu Yi and others at SFLEP in Shanghai for Mandarin for WS 5.0. · Robert Jedrzejczyk (http://prog.olsztyn.pl/paslibvlc) for his PasLibVCLPlayer which enables WordSmith to play video. Researchers from many other countries have also acted as alpha-testers and beta-testers and I thank them for their patience and feedback. I am also grateful to Nell Scott and other members of my family who have always given valuable support, feedback and suggestions. Mike Scott 425 my contact address WordSmith ideas for developing Feel free to email me at with any further . Tools API 11.2 It is possible to run the WordSmith routines from your own programs; for this there's an API If you know a programming language, you can call a .dll published. which comes with WordSmith and ask it to create a concordance, a word-list or a key words list, which you can then process to suit your own purposes. 31 Easier, however, is to write a very simple batch script which will run WordSmith unattended. 67 See also : custom processing December, 2015. Page 416</p> <p><span class="badge badge-info text-white mr-2">432</span> 417 WordSmith Tools Manual bibliography 11.3 Aston, Guy, 1995, "Corpora in Language Pedagogy: matching theory and practice", in G. Cook & B. Seidlhofer (eds.) Principle & Practice in Applied Linguistics: Studies in honour of H.G. Widdowson , Oxford: Oxford University Press, 257-70. The BNC Handbook Aston, Guy & Burnard, Lou, 1998, , Edinburgh: Edinburgh University Press. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan, 2000, Longman Grammar of Spok en and Written English , Harlow: Addison Wesley Longman. Clear, Jeremy, 1993, "From Firth Principles: computational tools for the study of collocation" in M. Baker, G. Francis & E. Tognini-Bonelli (eds.), 1993, Text and Technology: in honour of John Sinclair , Philadelphia: John Benjamins, 271-92. Cheng, Winnie, Chris Greaves & Martin Warren, 2006, From n-gram to skipgram to concgram. , Vol .11, No. 4, pp. 411-433. International Journal of Corpus Linguistics Dunning, Ted, 1993, "Accurate Methods for the Statistics of Surprise and Coincidence", Computational Linguistics , Vol 19, No. 1, pp. 61-74. Fillmore, Charles J, & Atkins, B.T.S, 1994, "Starting where the Dictionaries Stop: The Challenge of Corpus Lexicography", in B.T.S. Atkins & A. Zampolli, Computational Approaches to the Lexicon , Oxford:Clarendon Press, pp. 349-96. Katz, Slava, 1996, Distribution of Common Words and Phrases in Text and Language Modelling, Natural Language Engineering 2 (1), 15-59 Murison-Bowie, Simon, 1993, MicroConcord Manual: an introduction to the practices and principles of concordancing in language teaching , Oxford: Oxford University Press. Nakamura, Junsaku, 1993, "Statistical Methods and Large Corpora: a new tool for describing text types" in M. Baker, G. Francis & E. Tognini-Bonelli (eds.), 1993, Text and Technology: in , Philadelphia: John Benjamins, 293-312. honour of John Sinclair Statistics for Corpus Linguistics , Edinburgh: Edinburgh University Press. Oakes, Michael P. 1998, Scott, Mike, 1997, "PC Analysis of Key Words - and Key Key Words", System , Vol. 25, No. 2, pp. 233-45. Textual Patterns: k eyword and corpus analysis in language Scott, Mike & Chris Tribble, 2006, education , Amsterdam: Benjamins. Sinclair, John M, 1991, Corpus, Concordance, Collocation , Oxford: Oxford University Press. Stubbs, Michael, 1986, "Lexical Density: A Technique and Some Findings", in M. Coulthard (ed.) Talking About Text: Studies presented to David Brazil on his retirement, Discourse Analysis Monograph no. 13 , Birmingham: English Language Research, Univ. of Birmingham, 27-42. Stubbs, Michael, 1995, "Corpus Evidence for Norms of Lexical Collocation", in G. Cook & B. Seidlhofer (eds.) Principle & Practice in Applied Linguistics: Studies in honour of H.G. Widdowson , Oxford: Oxford University Press, 245-56. Tuldava, J. 1995, Methods in Quantitative Linguistics , Trier: WVT Wissenschaftlicher Verlag Trier. Youlmans, Gilbert, 1991, "A New Tool for Discourse Analysis: the vocabulary-management profile", Language , V. 67, No. 4, pp. 763-89. UCREL's log likelihood information 11.4 bugs All computer programs contain bugs. You may have seen a "General Protection Fault" message when using big expensive drawing or word-processing packages. If you see something like this, December, 2015. Page 417</p> <p><span class="badge badge-info text-white mr-2">433</span> 418 Reference then you have an incompatibility between sections of WordSmith. You have probably downloaded a fresh version of some parts of WordSmith but not all, and the various sub-programs are in conflict... The solution is a fresh download. http://lexically.net/wordsmith/version6/faqs/ updating_or_reinstalling.htm explains. Otherwise you should get a report popping up, giving "General" information about your PC and "Details" about the fault. This information will help me to fix the problem and will be saved in a small text file called wordsmith.elf, concord.elf, wordlist.elf , etc. When you quit the program, you will be offered a chance to email this to me. The first thing you'll see when one of these happens is something like this: You may have to quit when you have pressed OK, or WordSmith may be able to cope despite the problem. Usually the offending program will be able to cope despite the bug or you can go straight back into 4 it without even needing to quit the main WordSmith Tools Controller , retrieve your saved results 101 from disk, and resume. If that doesn't work, try quitting WordSmith Tools overall, or quit Windows and then start it up again. When you press OK, your email program should have a message with a couple of attachments to send to me. The email message will only get sent when you press Send in your email program. It is only sent to me and I will not pass it on to anyone else. Read it first if you are worried about revealing your innermost secrets ... it will tell me the operating system, the amount of RAM and hard disk space, the version of WordSmith, and some technical details of routines December, 2015. Page 418</p> <p><span class="badge badge-info text-white mr-2">434</span> 419 WordSmith Tools Manual which it was going through when the crash occurred. 461 error messages These warn you about problems which occur as the program works, e.g. if there's no room left on your disk, or you type in an impossible file name or a number containing a comma. 31 455 , troubleshooting . See also: logging 11.5 change language If you have results computed with the wrong language setting, that can affect things, e.g. a key word 315 listing depends on finding the words in the right order . To redefine the language of your data, , and in the resulting window choose Edit | Change Language Change once you have chosen a suitable alternative. If you choose a different one from the list press 124 of Alternatives, your Language and Text settings in the main Controller will change too. In this Change will change the language to Polish. screenshot, pressing 11.6 Character Sets 11.6.1 overview 444 You need "plain text" in WordSmith. Not Microsoft Word .doc files -- which contain text and a whole lot of other things too that you cannot normally see. If you are processing English only, your texts can be in ASCII, ANSI or Unicode; WordSmith handles both formats. If in other languages, read on... To handle a text in a computer, programs need to know how the text is encoded. In its processing, the software sees only a long string of numbers, and these have to match up with what you and I can recognise as "characters". For many languages like English with a restricted alphabet, encoding can be managed with only 1 "byte" per character. On the other hand a language like Chinese, which draws upon a very large array of characters, cannot easily be fitted to a 1-byte system. Hence the creation of other "multi-byte" systems. Obviously if a text in English is encoded in a multi-byte way, it will make a bigger file than one encoded with 1 byte per character, and this is slightly wasteful of disk and memory space. So, at the time of writing, 1-byte character sets are still in very widespread use. UTF-8 is a name for a multi-byte method, widely used for Chinese, etc. December, 2015. Page 419</p> <p><span class="badge badge-info text-white mr-2">435</span> 420 Reference In practice, your texts are likely to be encoded in a Windows 1-byte system, older texts in a DOS 1- byte system, and newer ones, especially in Chinese, Japanese, Greek, in Unicode. What matters most to you is what each character looks like, but WordSmith cannot possibly sort words correctly, or even recognise where a word begins and ends, if the encoding is not correct. WordSmith has to know (or try to find out) which system your texts are encoded in. It can perform certain tests in the background. But as it doesn't actually understand the words it sees, it is much safer for you to convert to Unicode, especially if you process texts in German, Spanish, Russian, Greek, Polish, Japanese, Farsi, Arabic etc. Three main kinds of character set, each with its own flavours, are Windows, DOS, and Unicode. Tip 44 To check results after changing the code-page, select Choose Texts and View the file in question. If you can't get it to look right, you've probably not got a cleaned-up plain text file but one 444 straight from a word-processor. In that case, take it back into the word-processor (see here for how to do that in MS Word) and save it as text again as a plain text file in Unicode. 419 420 42 See also: Text Formats , Choosing Accents & Symbols , Accented characters ; Choosing 81 Language accents & symbols 11.6.2 163 you may need to insert symbols and accented characters into When entering your search-word your search-word, exclusion word or context word, etc. If you have the right keyboard set for your if not, just choose the symbol in the main Controller — version of Windows this may be very easy 4 by clicking. December, 2015. Page 420</p> <p><span class="badge badge-info text-white mr-2">436</span> 421 WordSmith Tools Manual Below, you will see which character has been selected with the current font (which affects which characters can be seen). You can choose a number of characters and then paste them into Concord, by right-clicking and choosing from the popup-menu: These options above show Greek, Hebrew, Thai and Bengali characters have been clicked. The last one ("Paste") is the regular Windows paste. 81 419 See also: Choosing Language , Change Language December, 2015. Page 421</p> <p><span class="badge badge-info text-white mr-2">437</span> 422 Reference 11.7 clipboard You can block an area of data, by using the cursor arrows and Shift, or the mouse, then press Ctrl +Ins or Ctrl+C to copy it to the clipboard. If you then go to a word processor, you can paste or 102 ("paste special") the blocked area into your text. This is usually easier than saving as a text file 97 to a file) and can (or printing also handle any graphic marks. Example 1. Select some data. Here I have selected 3 lines of a concordance, just the visible text, no Set or Filenames information. 2. Hold down Control and press Ins or C. In the case of a concordance, since concordance lines are quite complex, you will be asked picture whether you want a of the selected screen lines, which looks like this in MS Word: with the colours and font resembling those in WordSmith, and/or plain text, and if so how many characters: December, 2015. Page 422</p> <p><span class="badge badge-info text-white mr-2">438</span> 423 WordSmith Tools Manual Once you've pressed OK, the data goes to the Windows "clipboard" ready for pasting into any other application, such as Excel, Word, Notepad, etc. For all other types of lists, such as word-lists, the data are automatically placed in the Clipboard in both formats, as a picture and as text. You can choose either one and they will look quite different from each other! Choose "Paste Special" in Word or any other application to choose between these formats. and then, for the picture format December, 2015. Page 423</p> <p><span class="badge badge-info text-white mr-2">439</span> 424 Reference You will probably use this picture format for your dissertation and will have to in the case of plotted data. In this concordance, you get only the words visible in your concordance line (not the whole line). What you're pasting is a graphic which includes screen colours and graphic data. If you subsequently click on the graphic you will be able to alter the overall size of the graphic and edit each component word or graphic line (but not at all easily!). Note that if you select more lines than will subsequently fit on your page, MS Word may either shrink the image or only paste one pageful. as plain text Alternatively, you might want to paste as plain Unformatted Unicode text because you want to edit the concordance lines, eg. for classroom use, or because you want to put it into a spreadsheet 102 ™. Here the concordance or other data are copied as plain text, with a tab such as MS Excel between each column. The Windows plain text editor Notepad can only handle this data format. Microsoft Word will paste (using Shift+Ins or Ctrl+V) the data as text. It pastes in as many characters as you have chosen above, the default being 60. At first, the concordance lines are copied, but they don't line up very nicely. Use a non proportional font, such as Courier or Lucinda Console, and keep the number of characters per line down to some number like 60 or so -- then it'll look like this: At 10 point text in Lucida Console, the width of the text with 60 characters and the numbers at the left comes to about 14 cm., as you can see To avoid word-wrapping, set the page format in Word to landscape, or keep the number of characters per line down to say 50 or 60 and the font size to December, 2015. Page 424</p> <p><span class="badge badge-info text-white mr-2">440</span> 425 WordSmith Tools Manual 10. avoid the heading and numbers in WordList or KeyWords too? 36 . See advanced clipboard settings contact addresses 11.8 Downloads You can get a more recent version at our website . There are also some free extra downloads (programs, word lists, etc.) there too. And links to sources of free text corpora. Screenshots for screenshots of what visit http://lexically.net/wordsmith/support/get_started_guides.html WordSmith Tools can do. This may give you useful ideas for your own research and will give you a better idea of the limitations of WordSmith too! Purchase for details of suppliers. Visit http://lexically.net/wordsmith/purchasing.htm Complaints & Suggestions Best of all, join Google Groups WordSmith Tools group and post your idea there so others can see the discussion. Or email me (mike (at) lexically.net). Please give me as full a description of the problem you need to tackle as you can, and details of the equipment too. Please don't include any attachments over 200K in size. I do try to help but cannot promise to... 11.9 date format Date Format Japanese date format year_month_day_hour_minute. At least it is logical, going from larger to smaller. Why aren't URLs organised in a logical order too? 11.10 Definitions 11.10.1 definitions valid characters Valid characters include all the characters in the language you are working with which are defined (by Microsoft) as "letters", plus any user-defined acceptable characters to be included within a word 435 (such as the apostrophe or hyphen ). That is, in English, A, a,... Z, z will be valid ; or @ or _ won't. In Greek, δ will count as a valid character. In Thai, ฏ (to patak) will characters but be a valid character. words 427 at each end . sequence of valid characters with a word separator The word is defined as a A word can be of any length, but for one to be stored in a word list, you may set the length you prefer (maximum of 50 characters) -- any which exceed your limit will get + tagged onto them at that point. You can decide whether or not to include words including numbers (e.g. $35.50 ) in text December, 2015. Page 425</p> <p><span class="badge badge-info text-white mr-2">441</span> 426 Reference 124 . characteristics token and type tok en to different words. So in This is my is used to refer to running words and type The term gets is book, it is interesting we have 7 tokens but only 6 different types because repeated. clusters A cluster is a group of words which follow each other in a text . The term phrase is not used here because it has technical senses in linguistics which would imply a grammatical relation between the 278 175 words in it. In WordList cluster processing there can be no or Concord cluster processing certainty of this, though clusters often do match phrases or idioms. See also: general cluster 448 . information sentences the full-stop, question-mark or exclamation-mark (.?!) and (equivalents The sentence is defined as in languages such as Arabic, Chinese, etc.) immediately followed by one or more word separators 427 and then a number or a currency symbol, or a letter in the current language which isn't lower- . Note: languages which do not distinguish between lower-case and upper-case characters do case not technically count any as lower case or upper case. (For more discussion see Starts and Ends 390 146 of Text Segments or Viewer & Aligner technical information .) paragraphs 146 Paragraphs are user-defined. See Starts and Ends of Text Segments for further details. headings 146 Headings are also user-defined -- see Starts and Ends of Text Segments . texts A text in WordSmith means what most non-linguists would call a text. In a newspaper, for example, there might be 6 or 7 "texts" on each page. This also means that a text = a file on disk. If it doesn't you're better off totally ignoring the "Texts" column in any WordSmith output. chargrams A chargram is a sequence of N consecutive valid characters (excluding digits and punctuation) found ABI,ABL,ABO in text. e.g. etc. In English the most frequent 3-chargrams are THE, ING, AND, . ION 240 243 236 124 , Associate , Key key-word See also: Setting Text Characteristics , Keyness December, 2015. Page 426</p> <p><span class="badge badge-info text-white mr-2">442</span> 427 WordSmith Tools Manual 11.10.2 word separators Conventionally one assumes that one word is distinguished from the next by the presence of spaces at either end. But WordSmith Tools also includes within word separators certain standard codes used by most word processors: page eject code (12), tabs (9), carriage return (13) and line feed 435 may optionally be considered to split words like self- (10), end-of-text (26). Besides, hyphens access into two words. Note that in Chinese and Japanese which do not separate words in this way, any WordSmith functions which require word-separation will not work unless you get your texts previously tagged with word-separators. 11.11 demonstration version WordSmith Tools offers all the facilities of the complete suite, The demonstration version of except that any screen which shows a list (of words in a word-list, or concordance lines, etc.) is limited to a small number of lines which can be shown or printed. (If you save data, all of it will be saved; it's just that you can't see it all in the demo version.) 21 425 451 See also: Installing , Contact Addresses . , Version Information 11.12 drag and drop You can get WordSmith to compute some results simply by dragging. If you have WordList open you can simply drag a text file onto it from Windows Explorer and it will create a word-list there and then using default settings. Or if it is not open, drag your text file to the Hamlet WordList6.exe file. Here, is being dragged onto the WordList tool. If you have KeyWords open you can simply drag a text file onto it from Windows Explorer. If you have a valid word list set as the reference corpus, it will compute the key words. Or if it is not open, drag your text file to the file, as in this screenshot where the KeyWords6.exe is being dragged onto the KeyWords file. Dickens novel Dombey and Son.txt .CNC ), a key word list If you drag a word-list made by WordList ( .LST ending), a concordance ( 4 , it will open it with the appropriate tool. ) etc. onto the Controller .KWS ( December, 2015. Page 427</p> <p><span class="badge badge-info text-white mr-2">443</span> 428 Reference 11.13 edit v. type-in mode Most windows allow you to press keys either · to edit your data (edit mode), or · to get quickly to a place in a list (type-in mode). 168 Concordance windows use key presses also for setting categories for the data, or for blanking 168 out the search word. 109 In type-in mode, your key-presses are supposed to help you get quickly to the list item you're to get to (or near to) theocracy in a word list. If you've typed in interested in, e.g by typing theocr 5 letters and a match is found, the search stops. Changing mode is done in the menu: Settings | Typing Mode: 168 See also: user-defined categories . 11.14 file extensions The standard file-extensions used in WordSmith are .cnc concordance file .lst word list .mut mutual information list .dcl detailed consistency list December, 2015. Page 428</p> <p><span class="badge badge-info text-white mr-2">444</span> 429 WordSmith Tools Manual .tokens, .types word list index file .kws key words file .kdb key word database file .base_pairs, .bas WSConcgram files e_index_cg .ali aligner list .vwr viewer list In the Controller's Main settings, or on installing, you can if you wish associate (or disassociate) the current file-types with WordSmith in the Registry. The advantage of association is that Windows will know what Tool to open your data files with. December, 2015. Page 429</p> <p><span class="badge badge-info text-white mr-2">445</span> 430 Reference finding source texts 11.15 For some calculations the original source texts need to be available. For example, for Concord to 165 show you more context than has been saved for each line, it'll need to re-read the source text. 251 , it needs to look at the source text to find out which For KeyWords to calculate a dispersion plot 247 . KWs came near each other and compute positions of each KW in the text and KW links If you have moved or deleted the source file(s) in the meantime, this won't be possible. 269 44 110 116 See also : Source texts , Editing filenames , Choosing source files , find files . 11.16 flavours of Unicode What is Unicode? What WordSmith requires for many languages (Russian, Japanese, Greek, Vietnamese, Arabic etc.) is Unicode. (Technically UTF16 Unicode, little-endian.) It uses 2 bytes for each character. One byte is not enough space to record complex characters, though it will work OK for the English alphabet and some simple punctuation and number characters. UTF8, a format which was devised for many languages some years ago when disk space was not suitable . limited and character encoding was problematic, is in widespread use but is generally That's because it uses a variable number of bytes to represent the different characters. A to Z will be only 1 byte but for example Japanese characters may well need 2, 3 or even more bytes to represent one character. December, 2015. Page 430</p> <p><span class="badge badge-info text-white mr-2">446</span> 431 WordSmith Tools Manual There are a number of different "flavours" of Unicode as defined by the Unicode Consortium . MS Word offers · Unicode · Unicode (Big-Endian) (generated by some Mac or Unix software) · Unicode (UTF-7) · Unicode (UTF-8) The last two are 1-byte versions, not really Unicode in my opinion. WordSmith wants the first of 365 these but should automatically convert from any of the others. If you are converting text , prefer Unicode (little-endian), UTF16. Technical Note There are other flavours too and there is much more complexity to this topic than can be explained here, but essentially what we are trying to achieve is a system where a character can be stored in the PC in a fixed amount of space and displayed correctly. Precomposed In a few cases in certain languages, some of your texts may have been prepared with a character A followed by ^ where the intention is for the software to display followed by an accent, such as them merged ( Â ), instead of using precomposed characters where the two are merged in the text 36 if you need to handle that situation. file. See the explanation in Advanced Settings folders\directories 11.17 Found in main Settings menu in all Tools. Default folders can be altered in WordSmith Tools or set 113 as defaults in wordsmith6.ini . December, 2015. Page 431</p> <p><span class="badge badge-info text-white mr-2">447</span> 432 Reference · Concordance Folder: for your concordance files. · KeyWords Folder: for your key-word list files. 101 WordList Folder: where you will usually save · your word-list files. 381 · Aligner: for your dual-text aligned work · Texts Folder: where your text files are to be found. 212 · Downloaded Media: where your sound & video files will be stored after downloading the first time from the Internet. · Settings: where your settings files (.ini files and some others) are kept. If you write the name of a folder which doesn't exist, WordSmith Tools will create it for you if 101 possible. (On a network, this will depend on whether you have rights to create folders and save files.) If you change your Settings folder, you should let WordSmith copy any .ini and other settings files which have been created so that it can keep track of your language preferences, etc. change according to which machine you're G:, H:, K: Note: in a network, drive letters such as on running from, so that what is H:\texts\my text.txt G:\texts\my text.txt on one terminal may be another. Fortunately network drives also have names structured like this: \\computer_name , with the advantage that . You will find that these names can be used by \drive_name\ WordSmith the same text files can be accessed again later. 21 If you run WordSmith from an external hard drive or a flash drive , where again the drive letter may change, you will find WordSmith arranges that if your folders are on that same drive they will 113 . change drive letter automatically once you have saved your defaults December, 2015. Page 432</p> <p><span class="badge badge-info text-white mr-2">448</span> 433 WordSmith Tools Manual Tip Use different folders for the different functions in WordSmith Tools. In particular, you may end up 237 of key making a lot of word lists and key word lists if you're interested in making databases words. It is theoretically possible to put any number of files into a folder, but accessing them seems to slow down after there are more than about 500 in a folder. Use the batch facility to produce very large numbers of word list or key words files. I would recommend using a folder to store .kdb files, and \keywords\genre1, \keywords\genre2, etc. \keywords .kws files for each genre. for the 430 . See also: finding source texts 11.18 formulae For computing collocation strength, we can use · the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?) the frequency word 1 altogether in the corpus · the frequency of word 2 altogether in the corpus · 180 we consider for being neighbours the span or horizons · the total number of running words in our corpus: total tokens · Mutual Information Log to base 2 of (A divided by (B times C)) where A = joint frequency divided by total tokens B = frequency of word 1 divided by total tokens C = frequency of word 2 divided by total tokens MI3 Log to base 2 of ((J cubed) times E divided by B) where J = joint frequency F1 = frequency of word 1 F2 = frequency of word 2 E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2) B = (J + (total tokens-F1)) times (J + (total tokens-F2)) T Score (J - ((F1 times F2) divided by total tokens)) divided by (square root of (J)) December, 2015. Page 433</p> <p><span class="badge badge-info text-white mr-2">449</span> 434 Reference where J = joint frequency F1 = frequency of word 1 F2 = frequency of word 2 Z Score (J - E) divided by the square root of (E times (1-P)) where J = joint frequency S = collocational span F1 = frequency of word 1 F2 = frequency of word 2 P = F2 divided by (total tokens - F1) E = P times F1 times S Dice Coefficient (J times 2) divided by (F1 + F2) where J = joint frequency F1 = frequency of word 1 or corpus 1 word count F2 = frequency of word 2 or corpus 2 word count Ranges between 0 and 1. Log Likelihood 417 p. 170-2. based on Oakes 2 times ( a Ln a + b Ln b + c Ln c + d Ln d - (a+b) Ln (a+b) - (a+c) Ln (a+c) - (b+d) Ln (b+d) - (c+d) Ln (c+d) + (a+b+c+d) Ln (a+b+c+d) ) where a = joint frequency b = frequency of word 1 - a c = frequency of word 2 - a d := frequency of pairs involving neither word 1 nor word 2 and "Ln" means Natural Logarithm 289 , Mutual Information See also: this link from Lancaster University December, 2015. Page 434</p> <p><span class="badge badge-info text-white mr-2">450</span> 435 WordSmith Tools Manual 11.19 history list History List: many of the combo-boxes in WordSmith like this one for choosing a search-word remember what you type in so you can look them up by pressing the down arrow at the right. 11.20 HTML, SGML and XML These are formats for text exchange. The most well known is HTML, Hypertext Markup Language, used for distributing texts via the Internet. SGML is Standard Generalized Markup Language, used by publishers and the BNC ; XML is Extensible Markup Language, intermediate between the other two. All these standards use plain text with additional extra tags, mostly angle-bracketed, such as <h1> and </h1>. The point of inserting these tags is to add extra sorts of information to the text: ) supplying details of the authorship & edition 1 a header ( <head> <bold>, <italics> ) how it should display (e.g. 2 is the body of the text) 3 what the important sections are ( <h1> marks a heading, <body> 4 how special symbols should display (é corresponds to é) 131 See also: Overview of Tags 11.21 hyphens The character used to separate words. The item "self-help" can be considered as 2 words or 1 word, 124 . depending on Language Settings December, 2015. Page 435</p> <p><span class="badge badge-info text-white mr-2">451</span> 436 Reference 11.22 international versions WordSmith can operate with a series of interfaces depending on the language chosen. If you choose French this is what you see in all of WordSmith. 416 See also: acknowledgements December, 2015. Page 436</p> <p><span class="badge badge-info text-white mr-2">452</span> 437 WordSmith Tools Manual 11.23 limitations The programs in WordSmith Tools can handle virtually unlimited amounts of text. They can read text from CD-ROMs, so giving access to corpora containing many millions of words. In practice, the 447 and b) patience. limits are reached by a) storage You can have as many copies of each Tool running at any one time as you like. Each one allows you to work on one set of data. 132 or ones containing an asterisk can span up to 1,000 characters. Tags to ignore 137 , only the When searching for tags to determine whether your text files meet certain requirements first 2 megabytes of text are examined. For Ascii that's 2 million characters, for Unicode 1 million. Tip 447 Press F9 to see the "About" box -- it shows the version date and how much memory you have available. If you have too little memory left, try a) closing down some applications, b) closing WordSmithTools and re-entering. 437 See also: Specific Limitations of each Tool tool-specific limitations 11.23.1 Concord limitations You can compute a virtually unlimited number of lines of concordance using Concord. 159 , though you can specify an Concord allows 80 characters for your search-word or phrase 161 . unlimited number of concordance search-words in a search-word file 180 of 25 Each concordance can store an unlimited number of collocates with a maximum horizon words to left and right of your search-word. WordList limitations 270 A head entry can hold thousands of lemmas , but you can only join up to 20 items in one go using F4. Repeat as needed. 263 Detailed Consistency lists can handle up to 50 files. KeyWords limitations One key-word plot per key-word display. (If you want more, call up the same file in a new display window.) 251 247 -windows per key-word plot display: 20. number of link 241 number of windows of associates per key key-word display: 20. File Utilities: Splitter limitations Each line of a large text file can be up to 10,000 characters in length. That is, there must be an <Enter> from time to time! December, 2015. Page 437</p> <p><span class="badge badge-info text-white mr-2">453</span> 438 Reference Text Converter limitations There can be up to 500 strings to search-and-replace for each. Each search-string and each replace-string can be up to 80 characters long. An asterisk must not be the first or last character of the search-string. When the asterisk is used to retain information, the limit is 1,000 characters. Viewer & Aligner limitations when choosing texts, Viewer & Aligner will call up the first 10 If you choose the View option source text files selected. When choosing texts or jumping into the middle of a text (e.g. after choosing in Concord), Viewer & Aligner will only process 10,000 characters of each file, to speed things up in the case of very large files, but you can get it to "re-read" the file by pressing to refresh the display, after which it will read the whole text. 437 See also: General Limitations 11.24 links between tools Linkage with Word Processors, Spreadsheets etc. 422 selected information to the clipboard All the windows showing lists or texts can easily copy . (Use Ctrl+Ins or Ctrl/C to insert). Where you see this symbol, you can send any selected data straight to a new Microsoft Word™ document. Where you see an URL (such as http://lexically.net ) you can click to access your browser. Links between the various Tools 4 WordSmith Tools are linked to each other via wordsmith.exe (the one which The programs in 4 " in its caption, and is found in the top-left corner of your says " WordSmith Tools Controller 113 , stop lists, etc. handles all the defaults screen). This , such as colours, folders, fonts WordList you'll go straight to a concordance, In general, if you press Ctrl+C in KeyWords or computed using the current word and using the current files. Each Tool will send as much relevant information as possible to the Tool being called. This will include: the current word (the one highlighted in the scrolling window) and the text files where any current information came from. : after computing a word list based on 3 business texts, you discover that the word Example is more frequent than you had expected. You want to do a concordance on that word, using hopeful hold down Control and press C. Now you can see hopeful, the same texts. Place the highlight on 175 plot. whether hopeful is part of a 3 -word cluster , or view a dispersion 237 texts, you discover that using 300 business key words database : after computing a Example December, 2015. Page 438</p> <p><span class="badge badge-info text-white mr-2">454</span> 439 WordSmith Tools Manual bid company , shares etc. Place the word seems to be a key key-word, and that it's associated with the highlight on bid, press Control-C and a concordance will be computed using the same 300 texts. Now you can check out the contexts: is a bid for power, or is it part of a tendering bid process? Example : you have a concordance of green . Now press Control-W to generate a word list of the same text files. Press Control-K to compare this word list with a reference corpus list to see what the key words are in these text files. 11.25 keyboard shortcuts scrolling windows: Control+Home to top of scrollable list Control+End to end of scrollable list 109 if it's ordered type-in your search-word alphabetically: and if it scrolls Home -- to left edge End -- to right edge horizontally: hotkeys: block a section Shift-cursor keys F1 help Ctrl+F2 101 save results print preview F3 Ctrl+P print results F4 join entries unjoin Ctrl+F4 Alt+F5 mark entries for joining Shift+Alt+F5 unmark entry F5 refresh a list auto set row height in Concord Shift+Ctrl+F8 F6 315 re-sort Ctrl+F6 315 reverse word sort Shift+Ctrl+F6 word-length sort F7 view source text grow line height F8 shrink Ctrl+F8 F9 About box (shows version-date and memory availability) compute collocates F10 F11 choose texts compute concordance Ctrl+Shift+C Ctrl+C copy December, 2015. Page 439</p> <p><span class="badge badge-info text-white mr-2">455</span> 440 Reference find again Ctrl+F3 find next deleted entry Alt+D Ctrl+L layout & columns of data Ctrl+M play media file Ctrl+N new Ctrl+U undo Ctrl+V paste Ctrl+W close Alt+X e X it the Tool Ctrl+Z 129 deleted lines Zap delete Del Numeric - delete to the end Ins restore deleted entry Numeric + restore to the end 441 see also: Menu items and Buttons 11.26 machine requirements This version of WordSmith Tools is designed for machines with: at least 1GB of RAM · at least 200MB of hard disk space · · Windows™ XP or later, or an emulator of one of these if using an Apple Mac or Unix system. 447 448 . on a faster You will find it runs better machine, especially if there's plenty of RAM 21 on a fast computer better than on a slow You can run WordSmith from a memory stick computer. (You can run WordSmith on a tiny 10" screen laptop with Windows Starter and little power but all applications on those are slow and there is not much screen for your results.) for details on There is no Apple Mac version but see http://lexically.net/wordsmith/mac_intel.htm how to use WordSmith on a Mac. manual for WordSmith Tools 11.27 21 . The file install This help file exists in the form of a manual, which you get when you , is in Adobe Acrobat™ format. It has a table of contents and a fairly detailed ( wordsmith.pdf) to help me create). Most people find paper easier to index (which I used WordList and KeyWords deal with than help files! 425 You may find it useful to see screenshots of . listed here in action: ideas are WordSmith December, 2015. Page 440</p> <p><span class="badge badge-info text-white mr-2">456</span> 441 WordSmith Tools Manual 11.28 menu and button options These functions may or may not be visible in each Tool depending on the capacity of the Tool or the current window of data -- the one whose caption bar is highlighted. advanced allows access to advanced features associates 241 . opens a new window showing Associates auto-join 270 ) automatically. joins (lemmatises auto-size re-sizes each line of a display so that each one shows as much data as it should. Most windows have lines of a fixed size but some, e.g. in Viewer, allow you to adjust row heights. This adjusts line heights according to the current highlighted column of data. close (Ctrl+W) closes a window of data clumps 243 computes clumps in a keywords database regroup clumps 244 the clumps regroups clusters 175 computes concordance clusters . collocates 179 using concordance data. shows collocates compute 63 calculates a new column of data based on calculator functions and/or existing data. redo collocates recalculates collocates, e.g. after you've deleted concordance lines. column totals 62 of numerical data. computes totals, min, max, mean, standard deviation for each column concordance (Shift+Ctrl+C) within KeyWords, WordList, starts Concord and concordances the highlighted word(s) using the original source text(s). copy (Ctrl+C) 102 66 allows you to copy your data to a variety of different places (the printer, a text file , the 422 clipboard , etc.). edit 72 109 ). allows editing of a list or searches for a word (type-in search exit (Alt+X) December, 2015. Page 441</p> <p><span class="badge badge-info text-white mr-2">457</span> 442 Reference quits a Tool. edit or type-in mode alternates between edit and type-in mode. filenames opens a new window showing the file names from which the current data derived. If necessary you 113 . can edit them find files finds any text files which contain all the words you've marked. grow increases the height of all rows to a fixed size. See shrink ( ) below. help (F1) opens WordSmith Help (this file) with context-sensitive help. join 270 ). joins one entry to another e.g. sentences in Viewer, words in WordList (lemmatisation layout 87 : the colour of each column, whether to hide This allows you to alter many settings for the layout a column of data, typefaces and column widths. links 247 between words in a key-words plot. computes links mark 270 75 or finding files . marks an entry for joining match lemmas 270 any that checks each item in the list against ones from a text file of lemmatised forms and joins match. match list 92 , marking matches up the entries in the current list against ones in a "match list file" or template any found with (~). relation 276 289 computes mutual information or similar scores in a WordList index list . new... (Ctrl+N) 2 gets you started in the various Tools, e.g. to make a concordance, a word list, or a key words list. open... (Ctrl+O) gives you a chance to choose a set of saved results. patterns 207 . computes collocation patterns play media (Ctrl+M) 212 plays a media file . plot 191 251 or KeyWords plot opens a new window showing a Concord dispersion plot . December, 2015. Page 442</p> <p><span class="badge badge-info text-white mr-2">458</span> 443 WordSmith Tools Manual print preview (F3) previews your window data for printing (Ctrl+P); can print to file, which is equivalent to "save as text 102 ". redo undoes an undo. refresh (F5) re-draws the screen (in Viewer re-reads your text file). remove duplicates 207 removes any duplicate concordance lines. replace 113 where the source search & replace, e.g. to replace drive or folder data, when editing file-names texts have been moved. re-sort 208 252 or , KeyWords re-sorts lists (e.g. in frequency as opposed to alphabetical order) in Concord 315 WordList . ruler 251 shows/hides vertical divisions in any list; text divisions in a KeyWords plot . Click ruler in a menu 191 . to turn on or off or change the number of ruler divisions for a plot save (also Ctrl+F2) 101 using existing file-name; if it's a new file asks for file-name first. saves your data save as saves after asking you for a file-name. save as text saves as a .txt file: plain text. search 109 searches within a list. shrink reduces the height of all rows to a smaller fixed height. See grow ( ) above. statistics 298 . shows detailed statistics statusbar toggles on & off the "status bar" (at the bottom of a window, shows comments and the status of what has been done). summary statistics 66 , e.g. proportion of lemmas to word-types. opens a new window showing summary statistics toolbar toggles on & off a toolbar with the same buttons on it as the ones you chose when you customised 31 . popup menus undo (Ctrl+U) undoes last operation. December, 2015. Page 443</p> <p><span class="badge badge-info text-white mr-2">459</span> 444 Reference unjoin 270 entries. unjoins any entries that have been joined, e.g. lemmatised view source text 379 and highlights any words currently selected in the list. shows the source text Microsoft Excel or Word™ , save formatted data for Excel or Word. wordlist 249 within KeyWords, makes a word list using the current data. zap (Ctrl+Z) 129 zaps any deleted entries. 31 439 see also: Keyboard Shortcuts , Customising popup menus . MS Word documents 11.29 Inside a file there is a lot of extra coding apart from the plain text words. (Actually, a .docx or .doc doesn't even seem to show the ordinary text words inside it!) For example, the name of your . docx printer, the owner of the software, information about styles etc. For accurate results, WordSmith needs to use clean text where these have been removed. converting your .DOC or .DOCX files 365 or files, is to convert using the Text Converter . The easiest method, for multiple .doc .docx Alternatively you can do it in Word .doc into plain text in Word can be done thus: To convert a or .docx File | Save As | Plain text: Chose then choose Windows (1-byte per character) December, 2015. Page 444</p> <p><span class="badge badge-info text-white mr-2">460</span> 445 WordSmith Tools Manual or Other encoding -- Unicode (2-bytes): December, 2015. Page 445</p> <p><span class="badge badge-info text-white mr-2">461</span> 446 Reference 11.30 never used WordSmith before For users who are starting out with WordSmith for the first time, the whole process can seem complex. (After all, the first time you used word-processing software that seemed tricky -- but you already knew what a text is and how to write one...) So a small text file accompanies the WordSmith installation, and if WordSmith thinks you have never used it before, it will automatically choose that text file for you to start using Concord, WordList etc. WordSmith's method of knowing that you are a new user is 101 ? 1) have any concordances or wordlists been saved and 50 2) has no set of favourite text files been saved for easy retrieval? numbers 11.31 124 Depending on Language and Text Settings , you might wish to include or exclude numbers from word lists. plot dispersion value 11.32 The point of it A dispersion value is the degree to which a set of values are uniformly spread. Think of rainfall in the UK -- generally fairly uniformly spread throughout the year. Compare with countries which have a rainy season. are distributed In linguistic terms, one might wish to know how the occurrences of a word like sk ull in Hamlet, and WordSmith has shown this in plot form since version 1. The dispersion value statistic gives mathematical support to this and makes comparisons easier. How it is calculated The plot dispersion calculated in KeyWords and Concord dispersion plots uses the first of the 3 417 (1998: 190-191), which he reports as having been evaluated as the formulae supplied in Oakes most reliable. 441 , it divides the plot into 8 segments for this. Like the ruler It ranges from 0 to 1, with 0.9 or 1 suggesting very uniform dispersion and 0 or 0.1suggesting 417 , 1996) "burstiness" (Katz 251 191 . , Concord dispersion plot See also: KeyWords plot December, 2015. Page 446</p> <p><span class="badge badge-info text-white mr-2">462</span> 447 WordSmith Tools Manual 11.33 RAM availability The more RAM (chip memory) you have in your computer, the faster it will run and the more it can store. As it is working, each program needs to store results in memory. A word list of over 80,000 entries, representing over 4 million words of text, will take up roughly 3 Megabytes of memory. (In Finnish it would be much more.) When memory is low, Windows will attempt to find room by putting some results in temporary storage on your hard disk. If this happens, you'll probably hear a lot of clicking as it puts data onto the disk and then reads it off again. You will probably hear some clicking anyway as most of the programs in access your original texts from the WordSmith Tools hard disk, but a constant barrage of thrashing shows you've reached your machine's natural limits. You can find out how much storage you have available even in the middle of a process, by pressing menu of each program). The first line states the RAM F9 (the About option in the main Help availability. The other figures supplied concern Windows system resources: they should not be a 101 save results problem but if they do go below about 20% you should , exit Windows and re- enter. Theoretically, word lists and key word lists can contain up to 2,147,483,647 separate entries. Each of these words can have appeared in your texts up to 2,147,483,647 times. (This strange number 2,147,483,647, half of 2 to the power 32, is the largest signed integer which can be stored in 32 bits and is also called 2 Gigabytes.) You are not likely to reach this theoretical limit: for the item to the have occurred 2,147,483,647 times in your texts, you would have processed about 30 thousand million words (1 CD-ROM, containing only plain text, can hold about 100 million words so this number represents some 300 CD-ROMs.) You would have run out of RAM long before this. If you have a Gigabyte of RAM or more you should be able to have a copy of a word-list based on millions of words of text, and at the same time have a powerful word-processor and a text file in memory. 448 speed See also: 11.34 reference corpus Reference Corpus A corpus of text which you use for comparative purposes. For example, you might want to compare a given piece of text with the British National Corpus , a collection of 100 million words. Useful when 229 computing key words . 113 4 for KeyWords and Concord to you can set your reference corpus word list In the Controller 318 258 make use of. (That is, a word list created using the WordList tool.) 11.35 restore last file By default, the last word list, concordance or key words listing that you saved or retrieved will be . If the last Tool used is Concord WordSmith Tools , a list of automatically restored on entry to your 10 most recent search-words will be saved too. wordsmith6.ini This feature can be turned off temporarily via a menu option or permanently in (in your Documents\wsmith6 folder). December, 2015. Page 447</p> <p><span class="badge badge-info text-white mr-2">463</span> 448 Reference single words v. clusters 11.36 The point of it... Clusters are words which are found repeatedly together in each others' company, in sequence. They represent a tighter relationship than collocates, more like multi-word units or groups or clusters because phrases. (I call them and phrases already have uses in grammar and groups because simply being found together in software doesn't guarantee they are true multi-word units .) 417 calls clusters, if repeated the right ways, "lexical bundles". Biber Language is phrasal and textual. It is not helpful to see it as a matter of selecting a word to fill a grammatical "slot" as implied by structural theories. Words keep company: the extreme example is idiom where they're bound tightly to each other, but all words have a tendency to cluster together with some others. These clustering relations may involve colligation (e.g. the relationship 179 and on ), collocation , and semantic prosody (the tendency for cause to between depend etc.). accident, trouble, come with negative effects such as 278 WordSmith Tools gives you two opportunities for identifying word clusters, in WordList and 175 Concord . They use different methods. Concord only processes concordance lines, while WordList processes whole texts. How they are computed ... Suppose your text begins like this: Once upon a time, there was a beautiful princess. She snored. But the prince didn't. If you've chosen 2-word clusters, the text will be split up as follows: Once upon upon a a time " because of the comma) (note time there not " there was (etc.) With a three-word cluster setting, it would send Once upon a upon a time there was a was a beautiful a beautiful princess But the prince the prince didn't (etc.) That is, each n-word cluster will be stored, if it reaches n words in length, up to a punctuation (It seems reasonable to suppose that a cluster does not cross clause boundary , marked by ;,.!? boundaries and these punctuation symbols help mark clause boundaries, but there is a Concord 310 163 or a WordList setting for this to give you choice.) setting 394 See also: concgrams . speed 11.37 networks If you're working on a network, WordSmith will be s-l-o-w if it has to read and write results across December, 2015. Page 448</p> <p><span class="badge badge-info text-white mr-2">464</span> 449 WordSmith Tools Manual the network. It's much faster to do your work locally on a C:\ D:\ drive and then copy any or useful results over to network storage later if required. and generally To make a word-list on 4.2 million words used to take about 20 minutes on a 1993 vintage 486-33 447 . The sorting procedure at the end of the processing took about 30 seconds. A with 8Mb of RAM 200Mz Pentium with 64MB of RAM handled over 1.7 million words per minute. On a 100Mz Pentium with 32Mb of RAM this whole process took about 3 and a half minutes, working at over a million words a minute. When concordancing, tests on the same Pentium 100, using one 55MB text file of 9.3 million words, and a quad-speed CD-ROM drive, showed search-word source speed CD-ROM quickly 6 million words per minute quickly 12 million wpm hard disk CD-ROM the 900,000 wpm the hard disk 1 million wpm CD-ROM thez 6 million wpm thez hard disk 16 million wpm Tests using a set of text files ranging from 20K down to 4K, using quick ly as the search-word, gave speeds of 2 million wpm rising with the longer files to 4 million wpm. Making a word list on the same set of files gave an average speed of 800,000 wpm. On the 55MB text file the speed was around 1.35 million wpm. was the These data suggest that factors which slow concordancing down are, in order, word rarity ( much slower than quick ly or the non-existent ), text file size (very small files of only 500 words thez or so (3K) will be processed about three times as slowly as big ones) and disk speed (the outdated quad speed CD-ROM being roughly half the speed of the 12ms hard disk). When Concord finds a word it has to store the concordance line and collocates and show it (so that you can decide to 123 suspend any further processing if you don't like the results or have enough already). This is a major factor slowing down the processing. Second, reading a file calls on the computer's file management system, which is quite slow in loading it, in comparison with Concord actually searching through it. Third, disk speeds are quite varied, floppy disks being much the worst for speed. If processing seems excessively slow, close down as many programs as possible and run again. Or install more RAM. Get advice about setting Windows to run efficiently WordSmith Tools (virtual memory, disk caches, etc.) Use a large fast hard drive. You can run other software while the programs are computing, but they will take up a lot of the 97 processor's time. Shoot-em-up games may run too jerkily, but printing a document at the same time should be fine. 11.38 status bar The bar at the bottom of a window, which allows you to pull the whole window bigger or smaller, and which also shows a series of panels with information on the current data. The status bar can usually be revealed or hidden using a main menu option. You can right-click on the panel to bring up a 428 popup menu offering choice between Edit, Type and Set . December, 2015. Page 449</p> <p><span class="badge badge-info text-white mr-2">465</span> 450 Reference 11.39 tools for pattern-spotting Tools are needed in almost every human endeavour, from making pottery to predicting the weather. Computer tools are useful because they enable certain actions to be performed easily, and this facility means that it becomes possible to do more complex jobs. It becomes possible to gain insights because when you can try an idea out quickly and easily, you can experiment, and from experimentation comes insight. Also, re-casting a set of data in a new form enables the human being to spot patterns. This is ironic. The computer is an awful device for recognising patterns. It is good at addition, sorting, etc. It has a memory but it does not know or understand anything, and for a computer to recognise printed characters, never mind reading hand-writing, is a major accomplishment. Nevertheless, the computer is a good device for helping humans to spot patterns and trends. That is why it is important to see computer tools such as these in WordSmith Tools in their true light. A tool helps you to do your job, it doesn't do your job for you. Tool versus Product Some software is designed as a product. A game is self-contained, so is an electronic dictionary. A word-processor, spreadsheet or database, on the other hand, is a tool because it goes beyond its own borders: you use it to achieve something which the manufacturers could not possibly anticipate. WordSmith Tools, as the name states, are not products but tools. You can use them to investigate many kinds of pattern in virtually any texts written in a good range of different languages 81 . Insight through Transformation No, this is not a religious claim! The claim I am making is psychological. It is through changing the shape of data, reducing it and then re-casting it in a different format, that the human capacity for noticing patterns comes to the fore. The computer cannot "notice" at all (if you input 2 into a calculator and then keep asking it to double it, it will not notice what you're up to and begin to do it automatically!). Human beings are good at noticing, and particularly good at noticing visual patterns. By transforming a text into a list, or by plotting keywords in terms of where they crop up in their source texts, the human user will tend to see a pattern. Indeed we cannot help it. Sometimes we see patterns where none was intended (e.g. in a cloud). There can be no guarantee that the pattern is "really there": it's all in the mind of the beholder. WordSmith Tools are intended to help this process of pattern-spotting, which leads to insight. The tools in this kit are intended therefore to help you gain your own insights on your own data from your own texts. Types of Tool All tools take up positions on two scales: the scale of specialisation and the scale of permanence. general-purpose ----------------- specialised general-purpose The spade is a digging tool which makes cutting and lifting soil easier than it otherwise would be. But it can also be used for shovelling sand or clearing snow. A sewing machine can be used to make curtains or handkerchiefs. A word-processor is general-purpose. December, 2015. Page 450</p> <p><span class="badge badge-info text-white mr-2">466</span> 451 WordSmith Tools Manual specialised A thimble is dedicated to the purpose of protecting the fingers when sewing and is rarely used for anything else. An overlock device is dedicated to sewing button-holes and hems: it's better at that job than a sewing machine but its applications are specialised. A spell-checker within a word-processor is fairly specialised. temporary ----------------- permanent temporary The branch a gorilla uses to pull down fruit is a temporary tool. After use it reverts to being a spare piece of tree. A plank used as a tool for smoothing concrete is similar. It doesn't get labelled as a tool though it is used as one. This kind of makeshift tool is called "quebra-galho", literally branch-breaker, in Brazilian Portuguese. permanent A chisel is manufactured, catalogued and sold as a permanent tool. It has a formal label in our vocabulary. Once bought, it takes up storage room and needs to be kept in good condition. The WordSmith Tools in this kit originated from temporary tools and have become permanent. They are intended to be general-purpose tools: this is the Swiss Army knife for lexis. They won't cut your fingers but you do need to know how to use them. 416 191 128 , Acknowledgements see also : Word Clouds , Dispersion Plots 11.40 version information This help file is for the current version of WordSmith Tools. The version of WordSmith Tools is displayed in the About option (F9) which also shows your 447 available amount of memory registered name and the . If you have a demonstration version this will be stated immediately below your name. Check the date in this box, which will tell you how up-to-date your current version is. As suggestions are incorporated, improved versions are made available for downloading. Keep a copy of your registration code for updated versions. You can click on the WordSmith graphic in the About box to see your current code. December, 2015. Page 451</p> <p><span class="badge badge-info text-white mr-2">467</span> 452 Reference 427 425 452 See also: 32-bit Version Differences . , Demonstration Version , Contact Addresses Version 3 improvements 11.40.1 After the earlier 16-bit versions of the 1990s, WordSmith brought in lots of changes "under the hood". · long file names 131 198 handling including Tag Concordancing · better tag and entity converter for previous data · 453 zip file handling · 102 easier exporting of data to Microsoft Word and Excel · 81 Unicode text handling, allowing more languages to be processed · 67 as it comes in, e.g. for language-specific lemmatisation · possibility of altering the data the old limitations of 16,000 lines of data went. (The theoretical limit for a list of data is over · 134 million lines.) 425 4 Contact Addresses See also: What's New in the current version . , December, 2015. Page 452</p> <p><span class="badge badge-info text-white mr-2">468</span> 453 WordSmith Tools Manual 11.41 zip files Zip files are files which have been compressed in a standard way. WordSmith can now read and write to .zip files. The point of it... Apart from the obvious advantage of your files being considerably smaller than the originals were, the other advantage is that less disk space gets wasted like this: any text file, even a short one containing on the word "hello", will take up on your disk something like 4,000 bytes or maybe up to 32,000 depending on your system. If you have 100 short files, you would be losing many thousands of bytes of space. If you "zip" 100 short files they may fit into just 1 such space. Zip files are used a lot in Internet transmissions because of these advantages. If you have a lot of word lists to store, it will be much more efficient to store them in one .zip file. The "cost" of zipping is a) the very small amount of time this takes, b) the resulting .zip file can only be read by software which understands the standard format. There are numerous zip programs PKZip ™ and Winzip ™. If you zip up a word list, these programs can unzip on the market, including can first unzip it and then it but won't be able to do anything with the finished list. WordSmith show it to you. How to do it... Where you see an option to create a zip file, this can be checked, and the results will be stored where you choose but in zipped form with the .zip ending. If you choose to open a zipped word list, concordance, text file, etc. and it contains more than one file within it, you will get a chance to decide which file(s) within it to open up. Otherwise the process processing. will happen in the background and will not affect your normal WordSmith December, 2015. Page 453</p> <p><span class="badge badge-info text-white mr-2">469</span> WordSmith Tools Manual Troubleshooting Section XII</p> <p><span class="badge badge-info text-white mr-2">470</span> 455 WordSmith Tools Manual 12 Troubleshooting 12.1 list of FAQs 31 . See also: logging These are the Frequently Asked Questions. 461 There's a much longer list of explanations under Error Messages . 455 Can't process apostrophes 456 Is this Russian, Greek or English? strange symbols in display 456 It crashed 458 It doesn't even start! 458 It takes ages! 457 Keys don't respond 456 Line beyond demo limit 456 Mismatch between Concord and WordList results 455 No tags visible in concordance 458 Printing problem 457 Text is unreadable because of the colours 455 Too much or too little space between columns 459 Wordlist out of order 458 Won't slice pineapples apostrophes not found 12.2 Apostrophes not processed can't find Concord If your original text files were saved using Microsoft Word™, you may find apostrophes or quotation marks in them! This is because Word can be set to produce "smart" symbols. The ordinary apostrophe or inverted comma in this case will be replaced by a curly one, curling left or right depending on its position on the left or right of a word. These smart symbols are not the same as straight apostrophes or double quote symbols. Solution: select the symbol in the character set in the Controller, then paste when entering your 159 355 , or else replace them in your text files using Text Converter search word . 113 See also: settings 12.3 column spacing column spacing is wrong 87 layout button. You can alter this by clicking on the 12.4 Concord tags problem no tags visible in concordance in If you can't see any tags after asking for Nearest Tag Tags , it is probably because the Concord 132 has the same format. For example, if , any tags such as <*> to Ignore Text to Ignore has 141 . <title> , <quote> , etc. will be cut out of the concordance unless you specify them in a tag file Solution: specify the tag file and run the concordance again. December, 2015. Page 455</p> <p><span class="badge badge-info text-white mr-2">471</span> 456 Troubleshooting 12.5 Concord/WordList mismatch Concord/WordList mismatch 448 WordList finds a certain number of but Concord finds a If occurrences of a (word list) cluster different number, this is because the procedures are different. WordList proceeds word by word, Concord ignoring punctuation (except for hyphens and apostrophes). When searches for a 175 (concordance) cluster it will (by default) take punctuation into account: you can change that in 222 if you wish. the settings 12.6 crashed it crashed! Solution: quit and enter again. If that fails, quit Windows and try again. WordSmith Tools 32 . The idea of Logging is to find out what is causing a crash. It is designed for when Or try logging WS gets only part of the way through some process. As it proceeds, it keeps adding messages to the log about what it has found & done. When it crashes, it can't add any more messages! So if you examine the log you can see where it was up to. At that point, you may see a text file name that it opened up. Examine that text, you might be able to see something strange about it, eg. it has got corrupted. 12.7 demo limit demo limit reached You may have just downloaded, but you haven't yet supplied your registration details. To do this, Settings | Register in the menu. go to the main WordSmith Tools window, and choose If you haven't got the registration code, contact Lexical Analysis Software (sales@lexically.net). 427 and a difference between a full version is: with the latter you only The demonstration version can see or print all the data, with the former you'll be able to see only about 25 lines of output. funny symbols 12.8 weird symbols funny symbols when using WordSmith Tools Notepad . Do they contain lots of strange symbols? 1. Check your text files. Look at them in These may be hidden codes used by your usual word-processor. Solution: open them in your , in plain text form Save As usual word-processor and , with a new name at, sometimes called "Text . In Word 2003 the option looks like this: .txt Only" or and then choose Unicode: December, 2015. Page 456</p> <p><span class="badge badge-info text-white mr-2">472</span> 457 WordSmith Tools Manual 2. Choose Texts , select the text file(s), right-click and View . Does it contain strange symbols? 365 to clean up and convert and your text files to Unicode. 3. Use Text Converter Greek, Russian, etc. 4. If the text is in Russian, Greek, etc. you will need an appropriate font, obtainable from your Windows cd or via the Microsoft website. 78 5. If you have several lists open which use or character sets, and yo different u change Font 124 Text Characteristics , the lists will all be updated to show the current font and character set, unless you first minimize any window which would be affected. funny symbols when reading WordSmith data in another application 97 101 102 Save or Save As and Saves as text WordSmith Tools to a file. "Save" by printing can s form . Thi WordSmith and "Save As" will store the file in a format for re-use by at is not suitable for reading into a word processor. The idea is simply for you to store your work so that you can return to it another day. "Save as Text", on the other hand, means saving as plain text, by "printing" to a file. This function is useful if you don't want to print to paper from WordSmith but instead take the data into a such as Microsoft Word. It spreadsheet, or word processor is usually quicker to copy the selected 422 text into the . clipboard illegible colours 12.9 text unreadable because of colours , choose . You can now set the colours which suit your computer Settings Colours Solution: in monitor. Monochrome settings are available. 12.10 keys don't respond Keys don't respond If a key press does nothing, it is probably because the wrong window, or the wrong column in the window, has the focus. As you know, Windows is designed to let users open up a number of programs at once on the same screen, so each window will respond to different key-press combinations. You can see which window has the focus because its caption is coloured differently from all the others. The solution is to click within the appropriate window/column, then press the key you wanted. December, 2015. Page 457</p> <p><span class="badge badge-info text-white mr-2">473</span> 458 Troubleshooting pineapple-slicing 12.11 won't slice a pineapple " Propose to any Englishman any principle, or any instrument, however admirable, and you will observe that the whole effort of the English mind is directed to find a difficulty, a defect, or an impossibility in it. If you speak to him of a machine for peeling a potato, he will pronounce it impossible: if you peel a potato with it before his eyes, he will declare it useless, because it will not slice a pineapple. " Charles Babbage, 1852. (Babbage was the father of computing, a 19th Century inventor who designed a mechanical computer, a mass of brass levers and cog-wheels. But in order to make it, he needed much greater accuracy than existing technology provided, and had all sorts of problems, technical and financial. He solved most of the former but not the latter, and died before he was able to see his Difference Engine working. The proof that his design was correct was shown later, when working versions were made. The difficulties he encountered in getting support from his government weren't exclusively English.) 12.12 printer didn't print printing problem If your printing comes out with one or more columns printed OK but others blank, you may have pulled your columns too wide for the paper. WordSmith uses information about your printer's defaults to compute what will and will not fit on the current paper. If you can change the printer settings to landscape that will give more space. 12.13 too slow It takes ages If you're processing a lot of text and you have an ancient PC with little memory and a hard disk that Noah bought from a man in the market for a rainy day, it might take ages. You'll hear a lot of 447 clicks coming from the hard disk is lo w. Solution: get a faster computer, by when memory installing more memory which makes a big difference), by defragmenting your hard drive, by using a disk cache, or by adjusting virtual memory settings. If you're running WordSmith Tools on a network, check with the network administrator whether performance is significantly degraded because of network access. Solution 2: quit all programs you don't need. That can restore a lot of system memory. Solution 3: quit Windows and start again. That can restore a lot of system memory. Solution 4: save and read from the local hard disk (C: or D:), not the network. 12.14 won't start it doesn't even start Yikes! December, 2015. Page 458</p> <p><span class="badge badge-info text-white mr-2">474</span> 459 WordSmith Tools Manual 12.15 word list out of order word-list out of order Words are sorted according to Microsoft routines which depend on the language. If you process Spanish but leave the Language settings to "English", you will get results which are not in correct Spanish order, (e.g. ). LL will come just before LM 81 Solution: choose your language and re-compute the word-list . December, 2015. Page 459</p> <p><span class="badge badge-info text-white mr-2">475</span> WordSmith Tools Manual Error Messages Section XIII</p> <p><span class="badge badge-info text-white mr-2">476</span> 461 WordSmith Tools Manual 13 Error Messages list of error messages 13.1 List of Error Messages 455 See also: . Troubleshooting 463 Can only save WORDS as ASCII 463 Can't call other Tool 463 Can't make folder as that's an existing filename 463 Can't merge list 463 Can't read file 464 Character set reset to <x> to suit <language> 464 Concordance file is faulty 464 Concordance stop list file not found 464 Conversion file not found 465 Destination folder not found 465 Disk problem: File not saved 465 Dispersions go with concordances 465 Drive not valid 465 Failed to access Internet 465 Failed to create new folder name 466 File access denied 466 File contains none of the tags specified 466 File not found 467 Filenames must differ! 467 Full drive:\folder name needed 467 function not working properly yet 462 INI file not found 467 Invalid Concordance file 468 Invalid file name 468 Invalid Keywords Database file 468 Invalid Keywords file 468 Invalid Wordlist Comparison file 468 Invalid Wordlist file 468 Joining limit reached: join & try again 469 Key words file is faulty 469 Keywords Database file is faulty 469 Limit of 500 file-based search-words reached 469 Links between Tools disrupted 469 Match list details not specified 469 Must be a number 470 Network registration running elsewhere or vice-versa 470 No access to text file: in use elsewhere? 470 No associates found 470 No clumps identified 470 No clusters found 470 No collocates found 471 No concordance entries found 471 No concordance stop list words 471 No deleted lines to Zap 471 No entries in Keywords Database 471 No Key Words found December, 2015. Page 461</p> <p><span class="badge badge-info text-white mr-2">477</span> 462 Error Messages 472 No key words to plot 472 No keyword stop list words 472 No lemma list words 472 No match list words 472 No room for computed variable 472 No statistics available 472 No stop list words 472 No such file(s) found 473 No tag list words 473 Not a valid number 473 No wordlists selected 474 Only X% of reference corpus words found 474 Original text file needed but not found 475 Registration string is not correct 474 Registration string must be 20 letters long 475 Short of Memory! 475 Source Folder file(s) not found 475 Stop list file not found 475 Stop list file not read 475 Tag file not found 475 Tag list file not read 476 This function is not yet ready! 476 This is a demo version 476 This program needs Windows 95 or greater 476 To stop getting this annoying message, Update from Demo in setup.exe 476 Too many ignores (50 limit) 476 Too many sentences (8000 limit) 476 Two files needed 476 Truncating at xx words -- tag list file has more! 476 Unable to merge Keywords Databases 476 Why did my search fail? 477 Word list file not found 477 Wordlist comparison file is faulty 477 Word-list file is faulty 477 WordSmith Tools has expired: get another 477 WordSmith Tools already running 477 WordSmith version mis-match 477 xx days left .ini file not found 13.2 .ini file not found WordSmith looks for the wordsmith6.ini file which holds your current defaults On starting up, 113 . If you've removed or renamed it, restore it. This file should be in a sub-folder of your Documents folder called \wsmith6. administrator rights 13.3 administrator rights If you see this error message it's because you need Administrator rights to register WordSmith. Try searching for "Run as Administrator" or this link . December, 2015. Page 462</p> <p><span class="badge badge-info text-white mr-2">478</span> 463 WordSmith Tools Manual base list error 13.4 base list error WordSmith is trying to access an word or concordance line above or below the top or bottom of the data computed. This is a bug. can only save words as ASCII 13.5 Can only save WORDS as Plain Text WordSmith Tools can't save graphics as a text file. If you get this error message, you can only 422 clipboard and pasting it into your word-processor. save this type of data by copying to the can't call other tool 13.6 Can't call other Tool 211 Inter-Tool communication has got disrupted. Save your work, first. Then, if necessary, close down WordSmith Tools altogether, then start the main wordsmith6.exe program again. 13.7 can't make folder as that's an existing filename Can't make folder as that's an existing filename file called C:\TEMP\FRED, you can't make a sub-folder of C:\TEMP called If you already have a FRED. Choose a new name. 13.8 can't compute key words as languages differ Can't compute key words as languages differ Key words can only be computed if both the text file and the reference corpus are in the same primary language. You can compute KWs using 2 different varieties of English or 2 different varieties of Spanish, but not between English and French. 13.9 can't merge list with itself! Can't merge list with itself You can only merge 1 word list or key word database with 1 other at a time. Select (by clicking while holding down the Control key) 2 file-names in the list of files. can't read file 13.10 Can't read file If this happens when starting up WordSmith Tools, there is probably a component file missing. One 4 example is sayings.txt, which holds sayings that appear in the main Controller window. If you've deleted it, I suggest you use notepad to start a new sayings.txt and put one blank line in it. If you get this message at another time, something has gone wrong with a disk reading operation. The file you're trying to read in may be corrupted. This happens easily if you often handle very large December, 2015. Page 463</p> <p><span class="badge badge-info text-white mr-2">479</span> 464 Error Messages files. See your Windows manual for help on fragmentation. character set reset to <x> to suit <language> 13.11 Character set reset to <x> to suit <language> 81 419 than Prior to version 2.00.07, WordSmith Tools handled fewer character sets and languages it does now. Accordingly, data saved in the format used before that version may not "know" what language it was based on. If you get this message when opening up an old WordSmith data file, it's because WordSmith doesn't know what language it derived from. Through gross linguistic imperialism, it will by default assume that the language is English! If the data are okay, just click the save button so that next time it will "know" which language it's 4 based on. If not, reset the language to the one you want in the Controller , Language Settings | Text, then re-save the list. 13.12 concordance file is faulty Concordance file is faulty has its own default filename extension WordSmith Tools Each type of file created by .CNC, .LST ) and its own internal structure. If you have another file with the same extension (e.g. produced by another program, this will not be compatible. It would not be sensible to rename a has detected that the file you're calling up wasn't .CNC file to .TXT, or vice-versa! WordSmith . Concord produced by the current version of concordance stop list file not found 13.13 Concordance stop list file not found , remember to include the full You typed in the name of a non-existent file. If typing in a file name drive and folder as well as the file name itself. confirmation messages: okay to re-read 13.14 Okay to re-read? A confirmation message. To proceed, Viewer & Aligner will now re-read the disk file. This will affect any alterations you've already made to the display. You may wish to save first and then try again later. Also, Viewer & Aligner will try to read the whole text file. If you have a very big file on a slow CD- ROM drive, this will take some time. 13.15 conversion file not found Conversion file not found You typed in the name of a non-existent file. If typing in a , remember to include the full file name drive and folder as well as the file name itself. December, 2015. Page 464</p> <p><span class="badge badge-info text-white mr-2">480</span> 465 WordSmith Tools Manual 13.16 destination folder not found Destination folder not found WordSmith couldn't find that folder; perhaps it's mis-spelt. 13.17 disk problem -- file not saved Disk problem: File not saved Something has gone wrong with a disk writing operation. Perhaps there's not enough room on the drive. If so, delete some files on that drive. 13.18 dispersions go with concordances Dispersions go with concordances 211 They can't be saved separately. drive not valid 13.19 Drive not valid WordSmith is unable to access this drive. This could happen if you attempt to access a disk drive which doesn't exist, e.g. drive P: where your drives include A:, C:, D: and E:. 13.20 failed to access Internet Failed to access Internet This function relies on a) your having an Internet browser on your computer, b) your system "associating" an Internet URL ending .htm with that browser. 13.21 failed to create new folder name Failed to create new folder or file-name A folder and a file cannot have the same name. If you already have a file called C:\TEMP\FRED , you can't make a FRED of C:\TEMP called sub-folder . Choose a new name. Or you don't have rights to create files in that folder. Or something went wrong while WordSmith was trying to write a file, for example the disk was full up. 13.22 failed to read file Failed to Read This may have happened a) because you included a text file which happens to be empty (zero size), or b) because your disk filing system has got screwed up, which is especially likely to occur if you often use large files in your word processor (in which do a disk cleanup) or c) because you tried to use the wrong kind of file for the job (for example the KeyWords procedure won't work if you choose text files as your word-lists). December, 2015. Page 465</p> <p><span class="badge badge-info text-white mr-2">481</span> 466 Error Messages 13.23 failed to save file Failed to Save Maybe because you had the same file open in another program or another instance of the Tool you're running. If so, close it and try again. Or because the folder you're saving to is a read-only folder on a network, or because the disk is full, or because your disk filing system has got screwed up. This last problem is quite common, actually, and is especially likely to occur if you often use large files in your word processor. In that case run Programs | Accessories | System Tools | Disk Defragmenter . 211 If you're working on a network, you will be able to save on certain drives and folders but not others; the solution is to try again on a memory stick or a hard disk drive which you do have the right to save to. 13.24 file access denied File Access Denied Maybe the file you want is already in use by another program. You'll find most word-processors label any text files open in them as "in use", and won't let other programs access them even just to read them. Close the text file down in your word processor. 13.25 file contains none of the tags specified File contains none of the tags specified You specified tags, but none of them were found. file has "holes" 13.26 File has "holes" Text files are supposed to contain only characters, punctuation, numbers, etc. without any unrecognised ones such as character(0). The problem could have arisen because it was transferred from one system to another, part of the disk is corrupted, or else maybe the file contains 473 . unrecognised graphics (or else it is not a plain text file but e.g. a Word document) You can solve this problem by converting the text using the Text Converter. If it is a plain text with holes these will be replaced by spaces. You can find texts with holes using the File Utilities. file not found 13.27 File not found 474 This message, like Original Text not found , can appear when WordSmith needs to access the original source text used when a list was created, but cannot find it. Have you deleted or moved it? If ) of this the file is still available, you may be able to edit the file names in the file name window ( list. Or the message may come after you've supplied the file name yourself. You may have mis-typed it. If typing in a file name, remember to include the full drive and folder as well as the file name itself. December, 2015. Page 466</p> <p><span class="badge badge-info text-white mr-2">482</span> 467 WordSmith Tools Manual filenames must differ! 13.28 Filenames must differ You can't compare a file with itself. folder is read-only 13.29 folder is read-only For some purposes, WordSmith needs to save files e.g. lists of results you have made so that you can get at recent files again. To do this it needs a place where your network or operating system lets you save. Usually \wsmith6 is fine, but in some institutional settings the drive or folder may be "read-only". If you see this message, choose Folder Settings and select there a folder where you can write as well as read. for use on X machine only 13.30 For use on pc named XXX only The software was registered for use on another PC. If you get this message, please re-install as appropriate. 13.31 form incomplete Form incomplete You tried to close a form where one or more of the blanks needed to be filled in before WordSmith could proceed. 13.32 full drive & folder name needed Full drive:\folder name needed , remember to include the full drive and folder as well as the file name file name When typing in a itself. 13.33 function not working properly yet function not working properly yet This is a function under development, still not fully implemented. 13.34 invalid concordance file Invalid Concordance file Each type of file created by WordSmith Tools has its own default filename extension (e.g. .CNC, .LST ) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .CNC file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't . produced by the current version of Concord December, 2015. Page 467</p> <p><span class="badge badge-info text-white mr-2">483</span> 468 Error Messages invalid file name 13.35 Invalid file name may not contain spaces or certain symbols such as ? and * File names . 13.36 invalid KeyWords database file Invalid Keywords Database file Each type of file created by WordSmith Tools has its own default filename extension (e.g. .KWS, .KDB ) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a WordSmith has detected that the file you're calling up wasn't .KDB file to .TXT, or vice-versa! produced for a database by the current version of KeyWords . 13.37 invalid KeyWords calculation Invalid Keywords calculation For KeyWords to calculate the key-words in a text file by comparing it with a reference corpus, both must be in the same language, both must be sorted in the same way (alphabetical order, ascending) and they should both be in the same format (Unicode or single-byte). If you see this message you are trying to compute KWs without meeting these criteria. Solution: open each word-list and check to see it is OK and that it is sorted alphabetically in the same way (in the Alphabetical view, click the top bar to re-sort in ascending alphabetical order), then save it. Check they have both been made with the same language & format settings and if necessary re-compute one or both of them. invalid WordList comparison file 13.38 Invalid Wordlist Comparison file has its own default filename extension WordSmith Tools Each type of file created by (e.g. .LST, .CNC ) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a WordSmith has detected that the file you're calling up wasn't .CNC file to .TXT, or vice-versa! WordList . produced as a comparison file by 13.39 invalid WordList file Invalid Wordlist file Each type of file created by WordSmith Tools has its own default filename extension (e.g. ) and its own internal structure. If you have another file with the same extension .LST, .CNC produced by another program, this will not be compatible. It would not be sensible to rename a .LST file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't produced by the current version of WordList . 13.40 joining limit reached Joining limit reached: join & try again 270 Only a certain number of words can be lemmatised in one operation. If you reach the limit and get this message, 1. lemmatise by pressing F4, December, 2015. Page 468</p> <p><span class="badge badge-info text-white mr-2">484</span> 469 WordSmith Tools Manual 2. place the highlight on the head entry again 3. press F5 and carry on lemmatising by pressing F5 on each entry you wish to attach to the head entry 4. when you've done, press F4 to join them up. KeyWords database file is faulty 13.41 Keywords Database file is faulty has its own default filename extension Each type of file created by WordSmith Tools ) and its own internal structure. If you have another file with the same extension KDB, .KWS (e.g. . produced by another program, this will not be compatible. It would not be sensible to rename a WordSmith has detected that the file you're calling up wasn't .KDB file to .TXT, or vice-versa! KeyWords produced for a database of keywords, by the current version of . KeyWords file is faulty 13.42 Key words file is faulty Each type of file created by WordSmith Tools has its own default filename extension (e.g. ) and its own internal structure. If you have another file with the same extension .KWS, .KDB produced by another program, this will not be compatible. It would not be sensible to rename a .KWS file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't KeyWords produced by the current version of . 13.43 limit of file-based search-words reached Limit of search-words reached 161 No more than 15 search-words can be processed at once, unless you use a file of search words to tell Concord to do them in a batch, where the limit is 500. 13.44 links between Tools disrupted Links between Tools disrupted 4 WordSmith Tools Controller or an individual Tool has tried to call another Tool and failed. There may have been a fault in another program you're running or a shortage of memory. As inter-tool 438 are vital in this suite, you should exit WordSmith and re-enter. communication links 13.45 match list details not specified Match list details not specified 92 button but then failed to choose a valid match list file or else to type You pressed the Match List in a template for filtering. Try again. must be a number 13.46 Must be a number L and 1 , and O You typed in something other than a number. Be especially careful with lower-case (the letter) instead of 0 (the number). December, 2015. Page 469</p> <p><span class="badge badge-info text-white mr-2">485</span> 470 Error Messages mutual information incompatible 13.47 Mutual information list is incompatible A mutual information list derives from an index file, and knows which index file it derives from when computed. Normally when it opens up, it opens up the corresponding index file too. If that index If the file is not found on your PC or has been renamed, you will see this message. The mutual information can still be accessed but a) what you see in terms of Frequency and Alphabetical lists refers to a different index file, and b) it will not be possible to get concordances directly from the listing . 13.48 network registration used elsewhere Network registration running elsewhere or vice-versa The site licence registration for use on a network is not valid for use on a stand-alone pc, and vice- versa. If you get this message, please re-install as appropriate. 13.49 no access to text file - in use elsewhere? No access to text file: in use elsewhere? The file cannot be accessed. Perhaps another application is using it. If so, close down the file in that other application and try again. no associates found 13.50 No associates found Settings | Min & Max Frequencies ) and try again. Alter settings ( 13.51 no clumps identified No clumps identified Alter settings and try again. no clusters found 13.52 No clusters found Alter the settings ( Settings | Clusters ) and try again. There were too few concordance lines to find the minimum number needed, or the cluster length was too great. 13.53 no collocates found No collocates found 4 In the Controller , alter the settings (Concord settings | Min. Frequency) and try again. There were too few concordance lines to find the minimum number needed. December, 2015. Page 470</p> <p><span class="badge badge-info text-white mr-2">486</span> 471 WordSmith Tools Manual 13.54 no concordance entries No concordance entries found If you got no concordance entries, either a) there really aren't any in your text(s), b) there's a problem with the specification of what you're seeking, or c) there's a problem with the text selection. Check how you've spelt the search-word and context word. If you're using accented text 161 419 , check the format of your texts. If you're using a search-word file , ensure this was prepared using a plain Windows word-processor such as Notepad. 159 (* and ?) accurately? If you are looking for a question-mark, Have you specified any wildcards you may have put "?" correctly but remember that question-marks usually come at the ends of words, so you will need . *"?" Tip 159 Bung in an asterisk or two. You're more likely to find book* than book. no concordance stop list words 13.55 No concordance stop list words no deleted lines to zap 13.56 No deleted lines to Zap 129 . No harm done. zap You pressed Ctrl+Z but hadn't any deleted lines to no entries in KeyWords database 13.57 No entries in Keywords Database Alter settings and try again. no fonts available 13.58 no fonts available for language The operating system does not have a font which can show the characters for that language. You need to find and install a font. 13.59 no key words found No Key Words found 235 too p value Alter settings and try again. The minimum frequency is set too high and/or the small for any key words to be detected. For very short texts a minimum frequency of 2 may be needed. December, 2015. Page 471</p> <p><span class="badge badge-info text-white mr-2">487</span> 472 Error Messages 13.60 no key words to plot No key words to plot Had you deleted them all? 13.61 no KeyWords stop list words No keyword stop list words WordSmith either failed to read your stop-list file or it was empty. 13.62 no lemma list words No lemma match list words WordSmith either failed to read your lemma list file or it was empty. 13.63 no match list words No match list words 92 WordSmith match list file, or it was empty, or you forgot to check the either failed to read your action to be taken (one option is None ). Or you tried to match up using a list of words, or a template, when the current column has only numbers. Or else there really aren't any like those you specified! no room for computed variable 13.64 No room for computed variable There isn't enough space for the variable you're trying to compute. 13.65 no statistics available No statistics available Some types of word list created by WordSmith Tools , e.g. a word list of a key words database have words in alphabetical and frequency order but no statistics on the original text files. You WordList cannot therefore call the statistics up in . You might also see this message if the statistics file you're trying to call up is corrupted. no stop list words 13.66 No stop list words WordSmith either failed to read your stop-list file or it was empty. no such file(s) found 13.67 No such file(s) found You typed in the name of a non-existent file. If typing in a file name, remember to include the full drive and folder as well as the file name itself. December, 2015. Page 472</p> <p><span class="badge badge-info text-white mr-2">488</span> 473 WordSmith Tools Manual 13.68 no tag list words No tag list words WordSmith either failed to read your tag file or it was empty. 13.69 no word lists selected No word lists selected For to know which word lists to compare, you need to select them, by clicking on one WordSmith in each folder. If you've changed your mind, press Cancel. 13.70 not a valid number Not a valid number has just attempted to read (e.g. from Either you've just typed in, or else WordSmith Tools 113 , the file), something which is expected to be a number but wasn't. wordsmith6.ini defaults O as equivalent to the number 0 Computers will not see the capital . Or else there is a number but accompanied by some other letters or symbols, e.g. £30 . If this happens when WordSmith is starting up, check out the wordsmith6.ini file for mistakes. 13.71 not a WordSmith file The file you are trying to open is not a WordSmith Tools file. WordSmith makes files containing your results, files whose names end in .LST, .CNC, .KWS , etc. These are in WordSmith's own format 444 cannot Word .doc and cannot be opened up by Microsoft Word -- likewise a plain text file or a usually be read in by WordSmith as a data file, but only as a text file for processing. 323 See also: Converting Data from Previous Versions 13.72 not a current WordSmith file Not a Current WordSmith File The file you are trying to open was made using WordSmith but either · it's a file made using version 1-3 or it's a file made with the beta version of WordSmith and the format has had to change (sorry!) · 323 If the former, you may be able to convert it using the Converter . nothing activated 13.73 Nothing activated Some forms have choices labelled "Activated" which you can switch on and off. If they are un- WordSmith will ignore them. checked, you can still see what they would be but December, 2015. Page 473</p> <p><span class="badge badge-info text-white mr-2">489</span> 474 Error Messages Only X% of words found in reference corpus 13.74 Only X% of words found in reference corpus When WordSmith computes key words it checks to see that most of the words in your small word- list are found in the reference corpus, as would be expected. If less than 50% are found, you will get this warning. That is a bit unusual, and is supplied as a warning that for example there might be something strange about one of your two texts. If you know there is nothing strange, then you could ignore the message. If you are processing clusters you are much more likely to see this warning, however, as the chance of 3-word strings matching in the two lists is less than that of single words matching. It is up to you to decide whether there is some error in what you are doing or it is OK for many of your smaller word list's words/clusters not to be found in the reference corpus word list. It might not be so unusual if your reference corpus was very small. But if it is indeed very small, the whole procedure is not very reliable. WordSmith simply looks at the frequencies of each word form and uses basic statistics to compute how greatly they differ in frequency. Basic statistics rely on a notion of what can be expected. If the reference corpus is incredibly small, WordSmith's computation of what is to be expected isn't really very reliable. As a dumb example if you met three citizens of a country you have never visited, and all looked fat, you might suppose the people of that country to be fat in general, but the sample size is not reliable for such an expectation. The KW procedure isn't really proof of anything, incidentally. Words don't occur in texts at all randomly and all ordinary basic statistics can do in my opinion is give us food for thought. So a KW listing isn't proof of anything but it may well give good ideas as to what may prove interesting avenues for research. original text file needed but not found 13.75 Original text file(s) needed but not found 113 WordSmith needed to find the original text file which the list was based on. But it To proceed, has been moved or renamed. Or if on a network, your network connection is not mapped, or the network is down ...or else the right disk or CD-ROM is not in the drive! 13.76 printer needed WordSmith needs a printer driver to be installed, even if you never actually print anything. You don't 97 function in Concord, need to buy a printer or to switch a printer on, but the Print Preview WordList, KeyWords etc. does need to know what sort of paper size you would print to. If you get a message complaining that no printer has been installed, choose Start | Settings | Printers & Faxes and install a default printer (any printer will do) in Windows. 13.77 registration code in wrong format Registration code unexpectedly short PASTE the registration supplied into the box; only paste into the Name or Other Details boxes the details supplied. If you see this message on registering you may have a registration for a previous major version. If so, contact sales at lexically dot net with your original purchase details and you will be entitled to a December, 2015. Page 474</p> <p><span class="badge badge-info text-white mr-2">490</span> 475 WordSmith Tools Manual 50% discount on the current version. registration is not correct 13.78 Registration is not correct It doesn't match up with what's required for a full updated version! The old registration code in earlier 427 mode. versions is no longer in use. WordSmith will still run but in Demonstration Version 13.79 short of memory Short of Memory! 447 . An operation could not be completed because of shortage of RAM 13.80 source folder file(s) not found Source Folder file(s) not found file name You typed in the name of a non-existent file. If typing in a , remember to include the full drive and folder as well as the filename itself. 13.81 stop list file not found Stop list file not found You typed in the name of a non-existent file. If typing in a file name, remember to include the full drive and folder as well as the file name itself. stop list file not read 13.82 Stop list file not read Something has gone wrong with a disk reading operation. The file you're trying to read in may be corrupted. This happens easily if you often handle very large files, especially if it's a long time since Scandisk you last ran to check whether any clusters in your files have got lost. See your DOS or Windows manual for help on fragmentation. tag file not found 13.83 Tag File not found You typed in the name of a non-existent file. If typing in a file name, remember to include the full drive and folder as well as the file name itself. tag file not read 13.84 Tag list file not read Something has gone wrong with a disk reading operation. The file you're trying to read in may be corrupted. This happens easily if you often handle very large files. See your Windows manual for help on fragmentation. December, 2015. Page 475</p> <p><span class="badge badge-info text-white mr-2">491</span> 476 Error Messages 13.85 this function is not yet ready This function is not yet ready! Temporary message, for functions which are still being tested. 13.86 this is a demo version This is a demo version 427 upgrade You will probably want to to the full version. this program needs Windows XP or greater 13.87 This program needs Windows XP or better From version 4.0, this program has required operating systems for this millennium. to stop getting this message ... 13.88 427 Get an update. This is "annoyware" for the demonstration version . too many requests to ignore matching clumps 13.89 The limit is 50. Do any remaining joining manually. too many sentences 13.90 The limit is 8,000. Do the task in pieces. 13.91 truncating at xx words -- tag list file has more The tag list file has more entries than the current limit. Or else it isn't a tag list file at all! 13.92 two files needed You need to select 2 files for this procedure. Select (by clicking while holding down the Control key) 2 file-names in the list of files. 13.93 unable to merge Keywords Databases 447 to carry out the merge. Perhaps there wasn't enough RAM 13.94 why did my search fail? ) for a list of data operates on the currently highlighted The standard search function (F12 or column. If you want to search within data from another column, click in that column first. By default, a search is "whole word". Use * at either end of the word or number you're searching for if you want to find it, e.g. in any data consisting of more than one word. (The advantage of the asterisk system is that it allows you to specify either a prefix or a suffix or both, unlike the standard December, 2015. Page 476</p> <p><span class="badge badge-info text-white mr-2">492</span> 477 WordSmith Tools Manual Windows search "whole word" option.) 13.95 word list file is faulty has its own default filename extension Each type of file created by WordSmith Tools (e.g. .LST, .KWS ) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a .CNC file to .TXT, or vice-versa! WordSmith has detected that the file you're calling up wasn't WordList produced by the current version of . 13.96 word list file not found You typed in the name of a non-existent file. If typing in a file name, remember to include the full drive and folder as well as the file name itself. WordList comparison file is faulty 13.97 WordSmith Tools Each type of file created by has its own default filename extension (e.g. . LST, .KWS ) and its own internal structure. If you have another file with the same extension produced by another program, this will not be compatible. It would not be sensible to rename a WordSmith .CNC file to .TXT, or vice-versa! has detected that the file you're calling up wasn't . produced as a comparison file by WordList 13.98 WordSmith Tools already running Don't try to start WordSmith Tools again if it's already running. Just Alt-tab back to the instance which is running. (You can, however, have several copies of each tool running at once.) 13.99 WordSmith Tools expired Message for limited period users only. Your version of WordSmith Tools has passed its validity and 425 427 . is now in demo mode. Download another from the Internet 13.100 WordSmith version mis-match 438 Since the various Tools are linked to each other, it is important to ensure that the component files are compatible with each other. If you get this message it is because one or more components is dated differently from the others. 425 . Solution: download those you need from one of the contact websites XX days left 13.101 427 Message for limited period users only. At the end of this time WordSmith will revert to demo mode. December, 2015. Page 477</p> <p><span class="badge badge-info text-white mr-2">493</span> Index 478 acknowledgements 416 add value to corpus 149 Index adding notes to data 29 adjust settings 29 Administrator rights 462 - # - Adobe .pdf to plain text 371 advanced concordance settings 163 # in clusters 278 advanced scripting 106 # symbol 318 advanced settings convert from UTF-8 31 - . - customising menus 31 deadkeys 31 force keyboard 31 .DOC files convert to .TXT in MS Word 444 Maori 31 convert to plain text using Text Converter 371 menu shortcuts 31 popup menu 31 .PDF to plain text 371 advanced settings button 80 .XLS to txt 371 alt-tab 127 .zip files 50 annotate source texts 149 - { - API 416 altering your data 67 custom .dll file 67 {CHR( conversion 362 lemmatising with custom .dll 67 apostrophe 125 - ~ - apostrophes -- curling or straight 367 Apple Mac 440 ~ operator 202 Application programming interface 416 - 2 - associate defined 243 associate word-lists and concordances with file-types 428 25 lines 456 associated entries lemmas 270 - 3 - associates 241 attach date to text file 48 32-bit version 452 auto sentence handling 146 auto-joining lemmas 273 - 5 - automated file-based concordancing 161 automated processing 106 500 key words 254 - B - - A - Babbage 458 about option 437 batch choosing 232 accents 420 batch concordancing 165 accents & symbols 420 batch processing accessing previous results 97 and Excel 39 accurate sort in WordList 315 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">494</span> 479 WordSmith Tools Manual settings 413 bibliography 417 Charles Babbage 458 Big5 372 check current version 24 blanking out entries 168 checking for updates 80 BNC handling of sentences and headings 146 Chinese Big5 372 selecting between texts 137 Chinese GB2312 372 selecting within texts 138 chi-square 245 tag file 141 chm files not visible 23 text format 435 Choose Languages BNC Sampler version 427 overview 7 boolean and/not 202 choosing files from standard dialogue box 51 bugs 417 choosing reference corpus 237 burstiness 446 choosing texts 42 .DOC files 42 - C - .DOCX files 42 .PDF files 42 .TXT files 42 calculating a plot 251 clear previous selection 44 call a concordance 234 Dickens text 44 calling other tools 438 store text files 44 cannot compare word-lists in different languages 468 classroom use class instructions 51 can't see Concord tags 455 setting up a training sesssion 51 CD-ROM university or school work 51 speed 448 storage 447 clipboard 422 clipboard advanced settings 36 change language of existing data 419 clumps changing font 78 regrouping 244 changing from edit to type-in mode 428 cluster Character Profiler definition 425 how to profile text 405 overview 6 cluster settings 310 purpose 405 clusters 448 settings 408 joining 283 reduction & merging 283 character sets 419 clusters in KeyWords 246 characters and letters accents window 28 Cocoa tags 368 wildcards 28 codepages 419 characters for different languages 420 codes 419 characters in save as text 222 codes in search-word 159 characters within word 125 collocate follow 184 chargram collocate minimum frequency & length 226 definition 425 collocates 179 display 181 chargram procedure 409 follow 184 Chargrams display 410 highlighting in concordance 186 horizons 180 overview 7 lemmas 183 purpose 409 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">495</span> Index 480 printing 211 collocates 179 purpose 158 minimum frequency 180 raw numbers 220 relationships 180 research uses 158 separated by search-word 222 saving 211 sorting 189 sorting 208 word clouds 189 sound and video 212 collocation 187 source text file 165 patterns 207 starting tips 14 settings 187 stretching the display to see more 165 specifications 187 student use 158 collocation associates 241 summary statistics 215 collocation breaks 188 teaching uses 158 colour categories 51 text segments 218 concordances 171 uniform plot 191 colouring specific characters 340 viewing options 220 colours what you see and can do 165 changing colours 60 wildcards 159 reset colour choices 60 zapping unwanted lines 206 column headings 71 concordance 159, 165 column marked green 111 advice 165 column tagged conversion 369 browsing original 165 column tagged mark-up 369 display 165 column totals 62 grow, shrink 165 columns in printing 80 highlighting collocates 186 comparing wordlists 260 padding 165 compute concordance from collocate 184 purple marks 165 compute keywords from a word list 259 settings 163 compute new column of data 63 concordance batch processing 163 concgrams 393 concordance characters lining up 422 filtering 402 concordances and colour categories 171 generating 395 concordancing Concord 165 multimedia 212 blanking 168 tags 198 breakdowns 215 Concord's save as characters 225 categories 168 consequence v. consequences 215 clusters 175 consistency analysis (detailed) 263 collocation 179 consistency analysis (simple) 262 creating exercises 168 consistency lists dispersion 191 sorting 268 hiding tags 220 contact addresses 425 index 158 context horizons 164 multiple search-words 161 context word 164, 202 nearest tag 199 contextual frequency sort 208 overview 5 context-word marking in text file 211 path visibility 220 controller (wordsmith exe file) 4 patterns 207 plot 191 Controller explained 27 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">496</span> WordSmith Tools Manual 481 definition of key key-word 240 convert data from old version 323 definition of sentence 426 convert within text files 362 definitions converter 354 chargrams 426 converting BNC XML version 374 clusters 426 converting Treetagger text 369 concgram 394 copy headings 426 all 66 paragraphs 426 choices 66 texts 426 selective 66 valid characters 425 specify 66 words 425 copy data to Word 422 deleting entries 129 Corpus Corruption Detector demo limit 456 aim 329 overview 7 demonstration version 427 process 329 detailed consistency 263 dice coefficient 268 corpus paragraph-count 298 relation statistics 268 corpus sentence-count 298 details of MSWord text 340 corpus word-count 298 dice coefficient correcting filenames 110 formula 433 count data frequencies 66 dice coefficient for detailed consistency 268 crash 417 Dickens text 446 cumulative scores 63 directories 431 curly quotes 367 dispersion 191 custom column headings 71 dispersion plot custom layouts sorting 210 removing 115 DOS to Windows 372 custom processing 67 download new version 24 cut spaces 222 downloaded text problems 354 cutting line starts 138 drag and drop 427 drop a text file onto WordSmith 427 - D - duplicate concordance lines 207 duplicate text files 348 data as text file 249 database construction 238 - E - database statistics 237 date format 425 edit mode 428 dates of texts 48 editing default colours 60 column headings 71 defaults 114 concordances 206 .ini files 113 delete if 71 network 113 delete to end 71 defining multimedia tags 147 random deletion of entries 70 definition reduce data to N entries 70 tokens 426 restore to end 71 types 426 reverse deletion 71 definition of associate 243 WordList entries 72 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">497</span> Index 482 links between Tools disrupted 469 encrypt your source texts 373 list 461 Entitities to characters 368 match list 469 Error messages must be a number 469 .ini file not found 462 mutual information 470 Administrator rights 462 network registration used elsewhere 470 base list error 463 no access to text file - in use elsewhere? 470 can only save words as ASCII 463 no associates found 470 cannot compute KWs 463 no clumps identified 470 can't call other tool 463 no clusters found 470 can't make folder as that's an existing filename no collocates found 470 463 no concordance entries found 471 can't merge list with itself! 463 no concordance stop list words 471 can't read file 463 no deleted lines to zap 471 character set reset to <x> to suit <language> 464 no entries in KeyWords database 471 concordance file is faulty 464 no fonts available 471 concordance stop list file not found 464 no key words found 471 confirmation messages - okay to re-read 464 no key words to plot 472 conversion file not found 464 no KeyWords stop list words 472 couldn't merge KW databases 476 no lemma list words 472 destination folder not found 465 no match list words 472 disk problem -- file not saved 465 no room for computed variable 472 dispersions go with concordances 465 no statistics available 472 drive not valid 465 no stop list words 472 expiry date 477 no such file(s) found 472 failed to access Internet 465 no tag list words 473 failed to create new folder 465 no word lists selected 473 failed to read file 465 not a current WordSmith file 473 failed to save 466 not a valid number 473 file access denied 466 not a WordSmith file 473 file contains "holes" 466 nothing activated 473 file contains none of the tags specified 466 only x% of words found in reference corpus 474 file not found 466 original text file needed but not found 474 filenames must differ 467 printer needed but not found 474 for use on pc named XXX 467 read-only folder 467 form incomplete 467 registration string is not correct 475 full drive & folder name needed 467 registration string must be 20 letters long 474 function not working properly yet 467 short of memory 475 invalid concordance file 467 source folder file(s) not found 475 invalid file name 468 stop list file not found 475 invalid KeyWords database file 468 stop list file not read 475 invalid KeyWords file 468 tag file not found 475 invalid WordList comparison file 468 tag file not read 475 invalid WordList file 468 the program needs Windows XP or greater 476 joining limit reached 468 this function is not yet ready 476 KeyWords database file is faulty 469 this is a demo version 476 KeyWords file is faulty 469 too many requests to ignore matching clumps 476 limit of file-based search-words reached 469 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">498</span> 483 WordSmith Tools Manual File Utilities Splitter Error messages bracket first line 343 too many sentences 476 end of text separator 343 truncating at xx words 476 end-of-text symbols 345 two files needed 476 filenames 344 version mis-match 477 index 343 why did search fail? 476 purpose 343 word list file not found 477 symbols 345 word list is faulty 477 WordList comparison file faulty 477 wildcards 345 WordSmith already running 477 File Viewer 340 XX days left 477 overview 8 example of aligning 381 file-based lemmatisation 273 example of key words 235 file-based search-words or phrases 161 Excel filename and path 220 column totals 102 filenames convert to .txt 371 display 113 editing 110 exercises 168 tab 113 exiting 101 export index data 286 file-types 428 export to spreadsheet etc. 102 find files containing words 269 external drive folder letters 431 find files with KWs 75 find which files contain a word or cluster 269 external hard drive 21 extracting from text files 359 finding a word word or part of a word 109 - F - finding by typing 109 finding relevant files 75 finding source texts 430 factory defaults 36, 115 factory settings 114 first use of WordSmith 446 favourite texts flash drive folders 431 loading 50 folder letters 431 saving 50 folder settings 78 file associations 428 folder view 80 File Utilities folders 431 compare 2 files 347 force detailed view 80 dates of texts 352 folders created using text converter 358 dodgy text 354 follow-up concordancing 171 editing text file dates 352 fonts file chunker 348 greek 78 find duplicates 348 russian 78 find holes in texts 354 force folders to show in detailed view 80 index 342 format test for text files 46 moving files to different folders 351 formulae overview 8 dice coefficient 433 rename 349 Log Likelihood 433 Splitter 8 MI 433 File Utilities Joiner MI3 433 joining text files 346 t-score 433 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">499</span> Index 484 HTML 435 formulae z-score 433 HTML headers cutting out 134 freeze columns 91 frequencies of suffixes 304 HTML/BNC/XML entities to characters 368 full lemma processing 181, 254 hyphen treatment 435 hyphens 125 - G - - I - GB2312 372 general settings 80 idioms 448 ignore punctuation 225 get favourite text selection 50 getting started 2 illegible 457 getting started with Concord 14 importing text into a word list 307 getting started with KeyWords 15 incompatibility between word lists 244 getting started with WordList 17 index lists export 286 globality of plot 446 making Wordlist Index 276 green marking in left column 111 uses 276 grow a concordance line 166 viewing 284 grow and shrink 165 index relationships 294 index settings 310 - H - information about WordSmith version 451 installing WordSmith Tools 21 handling hypens 125 instructions folder 23 handling multiple windows 127 interface 436 handling of numbers 125 international versions 436 handling Word .doc files 444 introduction to WordSmith Tools 2 hash representing words with numbers 318 inverted commas 455 header removing 134 it won't do what I want 455 headings definition 425 - J - start & end 146 hex 340 Japanese ShiftJis 372 hide tags 201, 222 joining clusters 283 hide words 222 joining entries 270 highlighting collocates in concordance 186 history list 97 - K - holes in file 466 horizons 180 key key word defined 240 hotkey combinations 439 key key-words 237 hotkeys key word procedure setting 254 Ctrl/F2 101 list 439 key word settings in Controller 254 space bar 168 keyboard 439 keyness how many words 437 definition 236 how much text 437 p value 235 how to build a database 238 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">500</span> 485 WordSmith Tools Manual characters within a word 124 keyness end of heading marker 124 thinking about 237 end of paragraph marker 124 key-ness defined 236 end of sentence marker 124 keyness scores 237 heading marker 124 keyword database headings (specifying) 124 related clusters 243 hyphens 124 statistics 237 numbers 124 KeyWords Languages Chooser advice 244 apostrophe 84 associates 243 font 86 calculation 245 language 84 choosing your files 232 new language 86 clumps 243 saving settings 87 clusters 246 sort order 86 compute a word list 249 Languages Chooser: overview 83 database 237, 238 disambiguation 243 layout alignment 87 display 252 column width 87 example 235 decimal places 87 failure/problems 244 editing column headings 87 finding KWs in other texts 75 grid format 87 index 229 headings 87 key key-words 240 save layout 87 links 247 typeface 87 overview 6 plot 251 lemma file 274 purpose 229 lemma list 274 sorting 252 lemma visibility settings 311 starting tips 15 lemmas 270 tips 244 auto-joining 273 file 274 keywords minimal processing 254 file-based 273 KeyWords plot display 249 joining automatically 273 joining manually 271 Korean Hangul 372 matching in WordList 274 KWs in other text files 75 template 273 visibility 311 - L - lemmatising source texts 373 lemmatising using a template 273 language letter-count 298 Baltic 81 Central European 81 licence details 22 change language in saved data 419 limitations 437 Cyrillic 81 links between tools 438 Greek 81 list of menu options 441 Portuguese 81 localisation 436 Russian 81 log file to trace problems 32 language settings log likelihood 245 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">501</span> Index 486 log likelihood 245 merge concordances 262 formula 433 merge wordlists 262 log likelihood computing 294 MI score 289 Log Likelihood score 289 MI3 formula 433 logging 32 lowest possible value for clusters 278 MI3 computing 294 LY endings in a word list 304 MI3 score 289 Microsoft Word 422, 444 - M - Minimal Pairs aim 331 choosing files 333 Mac version 440 output 336 machine requirements 440 overview 8 make a word list from keywords data 249 requirements 332 manual for WordSmith Tools 440 rules and settings 337 manual joining 271 running the program 338 manual lemmatisation 271 minimal processing (Concord) 226 mark_up minimal processing (KeyWords) 255 custom settings 133 modify source texts 149 custom settings for BNC tags 134 MS Word 444 document header removal 134 entity references 145 multimedia tags 147 tags as selectors 134 multiple file analysis 237 types of 145 multiple lists 39 marking 271 multi-word unit 149 mutual information marking context-word in txt 211 computing 294 marking entries formula 433 green margin 112 unmarking 112 mutual information scores 289 white margin 112 mutual information screen 289 marking search-word in txt 211 - N - mark-up 131 autoload tag file 132 colours 141 nag message 476 handling tag-types 132 nearest tag 199 HTML & SGML tags 132 negative keyness 236 making a tag file 141 negative keywords 254 multimedia 147 network settings 23 section tag 141 network speed 448 selecting between texts 137 network version 23 selecting within texts 138 networks match list defaults 113 filtering 92 new in version 6 4 mark words in a word list 92 new user 446 mean and standard deviation 298 n-grams 448 memory stick 21, 21 n-grams in WordList 278 memory usage 447 no hts within X words 222 menu choices 441 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">502</span> WordSmith Tools Manual 487 non-characters allowed within a word 125 phrases 448 notes 29 plain text 42 number handling 125 plot dispersion calculation 446 plot dispersion value 446 number of concordance entries 222 number sort 208 plot with grouped files 195 plots and links 247 numbering paragraphs Viewer & Aligner 386 plotting key words 251 numbering sentences precomposed characters 36 Viewer & Aligner 386 prefix frequencies 304 numbers previous lists 97 how treated 446 price 427 numbers in words: display 318 print preview zoom 97 - O - printer settings 80 printing blank print page 97 obtaining video and sound files 215 footer 97 omit # in clusters 278 header 97 online screenshots 4 landscape 97 only if containing 137 portrait 97 options for defaults 113 process text file if it contains X 137 ordering details 427 programming WordSmith 416 over-writing 355 prompts to save 36 Oxford University Press 427 punctuation breaks 188 - P - purple marks in word list display 318 - Q - p value 235 padding out the search-word with space 220 quitting 101 paragraph start & end 146 quotation marks 455 paragraph marker 124 - R - paragraph numbers Text Converter 362 RAM availability 447 paragraphs definition 425 randomised concordance entries 222 specifying 124 range 262, 263 pasting raw numbers 220, 221 as graphic or as text 422 raw numbers v. percentages 221 concordance into Word 422 re-compute filenames after zapping 129 paste special 422 recompute token count 297, 310 patterns recomputing plot 195 highlighting in concordance 186 reference corpus 447 pen drive 21, 21 registry 428 percentages v. raw numbers 221 regrouping clumps 244 permanent settings 113 relationship between collocate and search-word 180 phrase frames 281 relationship computing © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">503</span> Index 488 partial 101 relationship computing Log Likelihood 289 saving defaults 113 MI statistic 289 scripts 106 MI3 statistic 289 search & replace 110 T score 289 search by typing 428 Z score 289 searching 109 relationships computed from an index 294 searching by typing 109 relationships screen 289 search-word relationships: case sensitivity 181 advanced functions 163 removable drive 21 alternative search words 159 ascii codes for searching 159 remove all mark-up from a corpus 368 asterisk 159 remove custom layouts 115 boolean or 159 remove duplicates 207 case sensitivity 159 remove line-breaks 365 CHR 159 remove messages 115 file-based 161 remove some XML mark-up 370 history list 159 rename numerous files 349 slash 159 re-ordering 129 syntax 159 re-ordering word lists 72 text file specifying 161 repeated concordance lines 207 whole word 159 replacing 355 search-word marking in txt file 211 report on a crash 417 section Requirements 4 start & end 146 re-sorting selecting multiple entries 111 collocates 189 selecting within texts 138 restore all defaults 114 sentence restore factory defaults 36 auto handling 146 restore factory settings 114 start & end 146 restore last file 447 sentence breaks 188 restore last work 80 sentence lengths exporting 286 restore settings 115 sentence marker 124 restricted search 202 sentence only 222 ruler 249 sentences running words 425 definition 425 specifying 124 - S - separate search-words 222 Set column 168 save as set column colours 51 Excel 102 set textual date 48 HTML 102 settings 113 text 102 advanced 80 XML 102 colours 60 save favourite text file set 50 defaults 113 save prompts 36 folders 78 saving fonts 78 Ctrl/F2 101 general 80 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">504</span> WordSmith Tools Manual 489 word-count 298 settings 113 language 81, 124 statistics: vertical or horizontal 300 main 80 status bar 441, 449 permanent 113 statusbar 80 printer 80 stop at punctuation 188 restoring 115 stop at sentence break 188 show help 80 stop lists 120 statusbar 80 stop lists v. match lists 306 toolbar 80 stoplist.cod 362 ShiftJis 372 stopping 123 shortcuts 439 storage 447 show help at startup 113 suffix frequencies 304 show help file 80 summary statistics (general) 66 show or hide data below a minimum threshold 87 suspending processing 123 show or hide tags 201 swap tags and words 368 shrink a concordance line 166 symbols 420 SI numbers 125 single words 448 - T - slow 458 sorting T score 289 Concord 208 tag concordancing 198 consistency lists 268 tag file 141 dispersion plot 210 tag string only tags 143 KeyWords 252 tag types 145 tags 199 tag visibility 201 word list 315 tag-free corpus 368 sound & video tagged files 212 tagged text 131 sound file tags 147 tags source text(s) overview 131 view 116 tags in WordList 315 source texts 430 tags swapped with words 368 converting to a better format 365 tags to exclude 141 modify 149 tags to retain 141 speed 448 teacher instructions 51 SRT files TED talks 215 conversion 374 Test for Unicode 46 obtaining 215 test text file format 46 transcripts 215 text characteristics 124 standardised or mean type/token ratio 303 Text Converter start and end of sentence 146 .DOC 371 statistics .DOCX 371 headings 298 .Excel 371 letters 298 .PDF 371 paragraphs 298 <%IFDEF NEXT_TOPIC%>multi-word sections 298 linking<%ENDIF%> 9, 354, 355, 355, 359, 360, sentences 298 362, 363, 364, 365, 371, 374 word list statistics 298 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">505</span> Index 490 Concord 437 Text Converter KeyWords 437 asterisk 363 Text Converter 437 BNC XML version conversion 374 Viewer & Aligner 437 conversion file 362 WordList 437 cutting header 355 extracting 359 training students 51 folders 355 Treetagger conversion 369 index 355 Troubleshooting 455, 458 insert numbering 362 accented symbols 456 into Unicode 365 apostrophes not found 455 just one change 362 colours unreadable 457 line-breaks removal 365 column spacing 455 move if 360 Concord tags problem 455 multi-word linking 362 Concord/WordList mismatch 456 numbers: insert paragraph numbers in your corpus crashed 456 362 curly quotation marks 455 overview 9 demo limit 456 purpose 354 keys don't respond 457 removing all tags 363 pineapple-slicing 458 sample conversion file 364 potato-peeling machine 458 settings 355 printer won't print 458 syntax 363 quotation marks not found 455 Unicode conversion 365 smart quotations 455 UTF16 conversion 365 takes ages 458 UTF8 conversion 365 the English 458 wildcards 363 weird symbols 456 text date analysis 126 won't start 458 text file WordList out of order 459 use to build a word list 307 t-score formula 433 text file dates 352 text formats 124 T-score computing 294 text segments in Concord 218 Two word-list analysis 230 texts type choosing 44 definition 425 favourites 50 type and token more texts 44 definition 425 tie-breaking 208 type/token ratios 303 time-lines 126 type-in mode 428 title text 143 type-in search 109 to right only 295 types of tag 145 token typing characters into Concord 420 definition 425 token count 297 - U - token recomputing 310 toolbar 80, 441 undefined tags 222 tools for pattern-spotting 450 underscore 125 tool-specific limitations underscore tags 368 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">506</span> WordSmith Tools Manual 491 paragraph numbering 386 Unicode 419 purpose 379 unicode explained 430 reading in your plain text 387 Unicode test 46 sentence joining 389 Unix to Windows 373 sentence numbering 386 unjoin all entries 271 settings 390 unjoining entries 271 splitting 389 unmarking 271 technical aspects 390 unreadable 457 translation mis-matches 391 updater.exe 21 troubleshooting 391 updating WordSmith 80 unusual sentences 392 updating your version 21 viewing options 387 USB drive 21 viewing original text file 165 USB drive folders 431 viewing the original text 116 user licence 22 user-defined categories 168 - W - saving 149 user-defined processes 67 WebGetter UTF8 versus UTF16 430 display 326 limitations 328 - V - overview 12 purpose 324 valid character settings 325 definition 425 what is a concordance 159 value-added annotation 149 What's new 4 version 4 differences 452 whole word search 159 Version Checker why won't it... 455 overview 9 window management 127 version checking 24 windows version date 451 managing 127 version francaise 436 Windows file associations 428 vertical view of statistics 300 Windows XP 440 video word obtaining 215 definition 425 playing 212 word cloud settings Viewer & Aligner 381, 389 shape 60 adjusting with mouse 384 word clouds aligning 381 collocates 189 aligning -- an example 381 example 128 aligning the sentences 384 word count in MS Word 302 colours 387 Word documents 444 dual-text aligning 381 word patterns 207 editing 385 word separators 427 index 380 Word to .txt 371 Korean and English aligned text 381 word_TAG to <TAG>word 368 languages 385 WordList moving sentences 384 case sensitivity 314 overview 11 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">507</span> Index 492 wordsmith6.ini and networks 23 WordList clusters 278 WSConcGram coloured tags 315 aims 393 comparing word lists 260 definition of concgram 394 comparison display 261 display 397 compute keywords 259 exporting concgrams 404 create using text file 307 filtering 402 detailed consistency 263 generating concgrams 395 displaying comparisons 261 overview 12 find files containing a word 269 settings 395 finding entries 288 viewing 397 index 258 keys for searching 288 - X - locating entry-types 288 merging 262 X-letter word count 298 minimum & maximum settings 314 XML 435 n-grams 278 attributes 153 overview 6 entities 153 prefix for tag 315 parsing XML 153 purpose 258 text handling 153 searching using menu 288 XML simplification 370 simple consistency 262 sort order 315 - Y - sorting problems 459 starting tips 17 Yasumasa Someya 274 summary statistics 304 tags 315 - Z - tags as prefix 315 the basic display 318 Z score 289 WordList Index zapping clusters 310 filenames recomputed after 129 computing clusters 448 zip files 453 n-grams 448 z-score omit # 278 formula 433 relationship settings 310 Z-score computing 294 WordList: altering entries 72 Word's results are different 302 WordSmith controller Concord settings 222 index settings 310 KeyWords settings 254 WordList settings 311 wordsmith exe file (controller) 4 WordSmith Group 425 WordSmith Tools installation 21 manual 440 version 451 © 2015 Mike Scott</p> <p><span class="badge badge-info text-white mr-2">508</span> WordSmith Tools Manual</p> </div> </div> <div class="col-md-2"> <div> <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <!-- doc amp below related --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-2017906576985591" data-ad-slot="9235765260" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div> <h3>Related documents</h3> </div> </div> </div> </div> <script type="text/javascript" src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-5cc342bdc5d5f486"></script> <footer class="text-muted"> <div class="container"> <p class="float-right"> <a href="#">Back to top</a> </p> <p>2019 © DocMimic - <a href="/privacy">Privacy Policy</a> - <a href="/tos">Terms of Service</a></p> </div> </footer> <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous"></script> <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script> <script src="//cdnjs.cloudflare.com/ajax/libs/cookieconsent2/3.1.0/cookieconsent.min.js"></script> <script> window.addEventListener("load", function(){ window.cookieconsent.initialise({ "palette": { "popup": { "background": "#252e39" }, "button": { "background": "#14a7d0" } }, "theme": "classic" })}); </script> </body> </html>