Query syntax

Query strings

A valid query string in Coquery consists of one or more query item. Items are separated by spaces. The number of query items in a query string is not limited, but query duration increases with the number of items.

Query items

Query items are the smallest unit in a query string. Each query string has to contain at least one query item. The number of query items is not limited, but the time required to execute a query increases with the number of query items.

Query items are not case-sensitive: Walk, walk, and WALK all specify the same query item. There are four different types of query items. Word items, Transcription items, Lemma items, and Gloss items.

Word items

Word items are used to match tokens in the corpus based on how they are spelled. For example, the Word item walk matches all tokens that are spelled walk. Word items are supported by all corpora.

Transcription items

Transcription items are used to match tokens in the corpus based on their phonetic or phonological transcription. Slashes are used to distinguish Transcription items from Word items. The format of the transcription (e.g. International Phonetic Alphabet, SAMPA transcription symbols) may differ between corpora, and not all corpora provide phonetic or phonological transcriptions.

In order to match the token walk in the CMUdict Pronunciation Dictionary, the Transcription item /W AO1 K/ is used. In the CELEX Lexical Database, the token walk is matched by /'w$k/.

Lemma items

Lemma items are used to match tokens in the corpus that are assigned to the same lemma. Square brackets are used to distinguish Lemma items from Word items. For example, [walk] in CELEX matches the tokens walk, walking, walks, and walked.

Gloss items

Gloss items are used to match tokens for which a gloss or a translation is available, for example in a dictionary corpus. Double quotation marks are used to mark gloss items.

Part-of-speech specifiers

If you want to restrict the matches of a query item to a particular word-class, you can add a part-of-speech (POS) specifier to the query item. POS specifiers are separated from the query item by a dot ., and consist of a POS tag enclosed in square brackets. For example, walk.[nn1] matches all walk tokens in the BNC that are tagged as singular nouns.

The set of available POS tags may differ between different corpora. For example, COCA uses the CLAWS7 tag set, while BNC uses CLAWS5. A list of the POS tags that are available in the active corpus can be displayed by first right-clicking on the part-of-speech column column in the Output column list, and then selecting “View unique values” from the context menu.

For convenience, and in order to retain syntactic compatibility with the BYU syntax (see below), POS specifiers can also be used without providing also a query item. Thus, [nn1] matches all tokens tagged as singular nouns (NN1) in the BNC. Note that this syntax is potentially ambiguous: For example, the tag POS is used in the BNC to tag the possessive marker ‘s. At the same time, there are a few tokens that have POS as their lemma. Similarly, articles in CLAWS7 are tagged by AT, which also exists as a lemma at. In the case of such ambiguities between POS tags and lemmas, Coquery is compatible to BYU-COCA and gives precedence to the part-of-speech tag.

Some corpora (e.g. the CMUdict Pronunciation Dictionary) are not tagged for part-of-speech. With these corpora, POS specifiers cannot be used.

Wildcard characters

In order to allow for partial matching, query items and POS specifiers can contain any number of wildcard characters. There are two types of wildcards characters: question marks ? and asterisks *. They can be used within any query item type (i.e. Word, Lemma, Transcription, Gloss), as well as within the tags of POS specifiers.

Question mark wildcard

A question mark matches exactly one character. This wildcard character can be used to match tokens with a specific length. For example, the query item ? matches all words in the corpus that are spelled using only a single character, for example I, the numericals 1, 2, and so on, but also tokens that consist of a single punctuation mark or a special character such as ., _, or @. Similarly, the query item ??? matches all tokens that consist of three characters, for example the, web, 911, or :-).

If the query item contains other characters in addition to question mark wildcards, the token has to match exactly the other characters, but can have any character in the position of the wildcard. For example, ?alk matches in CELEX the tokens balk, calk, talk, and walk, but not chalk or stalk. It would also not match alk if that token existed in CELEX.

Asterisk wildcard

The asterisk matches any number of characters, including zero and one. A query item * consisting only of the asterisk wildcard will match all tokens in the corpus. This can be useful in longer query strings, or if the query item is restricted to a particular part of speech (see below).

For example, w* matches any token that begins with the character w. w*lk matches any token that begins with the character w and ends with the character sequence lk, e.g. walk, whelk, and waterfolk. w*l*k matches any token that begins with the character w and ends in the character k, and which also contains the character l in any position. This item matches the same tokens as w*lk, but also tokens such as warlock and woolsack.

In combination with POS specifiers, asterisk wildcards are particularly useful to match only tokens that belong to a particular word-class. For example, *.[n*] matches any token with a POS tag that starts with n. In the BNC, this matches the singular noun walk (tagged as NN1) as well as the plural noun walks (tagged as NN2).

Escaping characters

If you want to query any character that has a special meaning in query items (for example the wildcard characters, the square brackets, the quotation mark, or the slashes), you must precede it by the ‘escape’ character \\ (the backslash). For example, if you want to match all occurrences of the asterisk character in a corpus, you have to use the query string \\*, because the unescaped asterisk * is interpreted as a wildcard that matches any word.

Quantified query items

The Coquery syntax allows query strings that match sequences of tokens in the corpus that differ in the number of tokens. This is done by appending to a query item, the range of occurrences that the query should match. The range is enclosed in curly brackets (similar to quantification in regular expressions).

For example, the query string the [n*]{1,3} [v*] any sequence of one, two, or three nouns in the corpus that is preceded by the word the and followed by a verb. In the current absence of a way that allows users to query phrasal constituents in a corpus, this syntax can be used to construct queries that approximate phrasal constructions. Many English noun phrases can be queried by using the query string [DT]{0,1} [jj*|,]{0,6} [n*]{1,3} ~[n*]: this matches any sequence of words that starts either with one or no determiner, followed by a group of up to six tokens that can either be adjectives or commas, followed by a group of one, two, or three nouns, followed by a word that is not a noun.

OR operator

The pipe symbol | acts as an OR operator. The OR operator is available for all query types. The results table will contain the union of the matching tokens. For example, walk|walked|walks matches either walk, walked, walks, but not walking. [walk|talk] in CELEX matches the tokens walk, walking, walks, and walked, as well as the tokens talk, talks, talking, and talked.

Query item lemmatization

Any query item can be prefixed by the hash mark #. This prefix indicates that the matches of this query item will be lemmatized: instead of returning only the exact matching tokens, all other forms that share the corresponding lemma will be returned as well. For example, the query string #wrote will match all tokens of the lemma WRITE, i.e. write, writes, writing, wrote, and written.

Additional examples

/?/ matches all tokens with a transcription that consists of only a single character.

COCA/BYU syntax compatibility

The syntax of the query strings used by Coquery is modelled after the syntax used in the Bigham Young University web corpus interfaces such as COCA or BYU-BNC. A description of the BYU syntax can be found here: http://corpus.byu.edu/coca/help/syntax_e.asp

Most query strings that are valid in the BYU web interfaces are also valid query strings in Coquery. However, where BYU-COCA uses - to negate query items, Coquery uses the hash mark #. Also, Coquery currently does not support synonym matching: [=beat].[v*] matches verbs like hit, strike, or defeat in BYU-COCA. In contrast, Coquery matches this query string to all tokens that are tagged as verbs, and which are written as =beat. Most likely, no token will be matched.

Coquery extends the BYU syntax by allowing for quantified query items. Also, Transcription queries are not supported by the BYU syntax.