This interface connects you to a morphological server, which performs morphological analysis of words forms for Czech. The results are presented in a tabular, color-coded form where everything can by converted to a more detailed, human-readable description by a single click.
The main features of the interface and of the morpholgical server are:
See below for more detailed information and help. Or go right to the Morphological Analyzer Word/Text entry page now and try it!
For the full use of all the features, you need a JavaScript-enabled browser. Netscape 4.03 and later appears to work, as does Internet Explorer 5.0 and up. If you do not have a graphical user interface at all, you may use Lynx which provides limited functionality but still works. In some cases, using a completely JavaScript-unaware browser might work better than a browser with broken JavaScript (such as Netscape 3.01). We hope to solve the problems soon so that you will be able to use those unfortunate pieces of software, in case upgrading your browser is not an option.
The word forms are entered interactively by you in the little input window ("Word or short text to analyze"), and results appear in a new window. There is always one Results, Help, Description etc. window open for any given browser instance. These windows apper and/or are refreshed as necessary; do not be afraid to close them, they will reappear with the next results. You may resize them so that you can watch the results change as you enter new input word or text, minimize etc. as usual.
You may use various coding schemes, depending on the capabilities of your system and/or browser. See below under Input (Entry) Code and Output (Display) Code for detailed instructions. If you are familiar with those, you might want to consult the Code Tables directly.
The so-called Contents Filters discard some of the results of the analysis to lower the ambiguity rate in an environment where they do not make sense. For example, you can filter out all colloquial Czech analyses if you know that the input is a controlled or simply standard Czech, such as in respectable Czech newspaper (such as Mladá Fronta DNES, Lidové Noviny etc.)
Click anywhere in the input box. When the cursor starts blinking, you may enter text. Use spaces between words as usual; there is no need to use space in front or between punctuation (the input text is being tokenized internally in the morphological analyzer). Numbers and other weird things are allowed. For entering accented characters, see below (for quick reference, see the Latin 2 Code Tables). Maximum number of characters accepted is 250 (over four continuous lines of text; this limit helps to make sure you do not have to wait too long -- the surrounding HTML markup for a 40-word sentence inflates the page to an average of 30k bytes or sometimes even much more). For those of you who do not know any Czech and just want to crash the thing, try words like "a", "by", "je", "KDU", "V", "administrativou", and if you are really into an adventure, try to cut and paste "nejštědřejšíma".
When finished, please read the Copyright Notice on the bottom of the page, and then press the Submit button. Once you know what the Copyright Notice says, you can also simply press the "Return" key while the text input box is active (the cursor is blinking).
Back to top of Help.
Input (Entry) Code
There are still problems entering Czech accented characters in some browsers; they rely too much on the operating system ability to enter the correct accented letters. Unfortunately, even if this is your case, the browser will not tell which code the characters have been entered in. Thus the analyzer must be provided with reliable side information about the input code. In some cases the browser will do the right translation of the accented characters (since of course, it has been instructed to do so) without even correctly selecting between ISO Latin 2 and Windows CP 1250 (see below, we try to automatically set the correct conversion depending on your browser); however, so far only the MS Internet Explorer version 4.0 and Netscape 4.03 and later is capable of doing so (Czech localized version of MSIE 5.00 and plain US English Netscape 4.03 and 4.74 have been proved to work; some other verisons of MSIE have also been reported to work (4.0, 5.5); your version might work, too. Just try it.)
If you are on a Unix system which has been properly localized into Czech, use the default (ISO Latin 2, or iso-8859-2) setting. Even if you cannot see the characters properly when typing in the browser window (which means that your Czech keyboard works, but there is a bad system font setting, and/or broken Netscape... oh well), don't worry; they will be displayed correctly (with high probability :-)) on the output screen (or if not, set one of the alternative output settings.)
If you have Windows 95, 98 or 2000 Professional, your system is most likely set for the Windows CP 1250 codepage (or at least you can invoke the proper "keyboard" by pressing Ctrl-Shift, or Alt-Shift or something similar to employ the Czech keyboard -- watch the lower right screen corner for blue "Cz"). If you have not installed the Czech keyboard yet, it is easy to do so (you need the Windows CD, though, unless it is already on your hard disk, which is often the case): in the Control panel, click on Keyboard, the Add, and find Czech. Check the screen for any additional settings, reboot and voila. (Don't forget to switch to it by pressing left Ctrl-Shift or whatever you designated during installation as your CZ Keyboard switch combination.) The main web page uses JavaScript to determine your browser and operating system, and tries to set the default encoding for you, but please check it before you send a bug report. The usual confusion concerns the characters s, z and t, all accented with the "hacheck" accent (š, ž, ť), or their Windows CP 1250 / ISO Latin2 codemates.
If you cannot do either ISO Latin 2 nor Windows CP 1250, the remaining input modes allow for alternative entry methods, including easy cut & paste from various applications and editors. If you write in (La)TeX (or "Scientific Word", and can see the TeXized source), you might be familiar with the usual backslash notation: \'{a} or \'a stands for accented ("long" accent) letter a (similarly for other), \v c or \v{c} for accented ("hacheck" accent) letter c. For u with ring, use {\accent'27u} (similarly for capital U).
If you prefer SGML, then SGML entities are the solution. For every accented letter, a SGML entity name exists. Entities names are enclosed between ampersand (&) and a semicolon (;). For example, A with (long) accent (Á) is Á. Lowercase r with "hacheck" accent (ř) is coded with the 'caron' suffix: ř. U with ring (Ů) uses Ů. These names are not handled by browsers correctly (in fact, even the newest HTML 4.01 specification does not even contain the ISO Latin 2 entity names as a recommendation; no wonder browsers do not understand it), but the analyzer interface does its own encoding.
For quicker input, a Pseudocode has been implemented. It uses single special character for each accent, written after the character which is being accented. Single apostrophe (') is used for long accent, double quotes for "hacheck" accent and for ring, too.
Finally, if you select Unaccented, different dictionary will be employed altogether to give you all possible analyses regardless of accents. For example, if you enter 'byt' in this mode, analysis will be provided for all of 'byt' (lit. apartment), 'být' (lit. to be), and 'byť' (lit. although).
A graphical code table is provided showing the ISO Latin 2 and Windows CP 1250 codes and characters and all alternative means of character entry.
Sometimes it is not desirable to output really all possible analyses of a given form, since it only raises the amount of homonymy in cases which are thought to be clearly distinguishable. For example, if you analyze current news stories from a reputable newspaper, it is highly unlikely that a colloquial form is used anywhere in the text, especially if it can be confused with a standard form.
Five settings are provided, which in turn set internal flags for suppressing certain combination of results based both on morphological information (tags) as well as lexical information (if present in the main dictionary).
You will rarely want to use Unaccented on output, however, since any of the substitutional methods will give you full and well-visible information, even if your fonts are broken, too. Moreover, if nothing pleases you enough, you can select additional two coding possibilities on output: fully graphical output of accented letters which is understood by any browser which can images in the GIF format (Graphics output, or for better visibility Graphics output (bold); do not forget switch on "Images" if you have them off for continuous loading using these code options). Also please keep in mind that ISO Latin 2 and Windows 1250 are pretty close for Czech characters - except for the following:
(š)
(ť)
(ž) A graphical code table is provided showing the ISO Latin 2 and Windows CP 1250 codes and characters as well as all alternative means of displaying accented characters.
1 The 100% compatibility does not
apply to the "Tagger" mode (when the "Run Tagger" box is checked)
since the tagger must use slightly different dictionary (namely, the
one which has been used to train the tagger). However, the tag and
lemma description works in both cases the same way and the
descriptions do not differ.
Please send comments, suggestions and bug reports
to hajic@cs.jhu.edu.