Utilities
This chapter deals with the JWPce Utilities.
Introduction
The Utilities are a number of small programs to handle specific tasks related to JWPce. Most people will never need to use these, but some people may be interested in them. All utilities are run from the command line. Currently there are 5 utilities:
Utility Function JINDEX Generates index files for EUC and UTF-8 dictionaries. KINFO Generated KANJINFO.DAT from Jim Breen’s KANJIDIC. RINFO Manipulates the radical lookup databases. UINFO Generates UNICODE conversion tables. WINFO Generates kana->kanji conversion database.
JINDEX -- Dictionary Index Utility
The JINDEX utility generates index files for EUC and UTF-8 dictionaries (mixed mode dictionaries are not currently supported). The format used for the command arguments are:
JINDEX dictionary_file [ flags ]The index is written to the same location as the dictionary_file, but will have the extension .JDX.
The flags parameter indicates any number of the following flags:
Flag Function ALLKANA Includes every kana in the index. ANYKANA Includes every kana string of 2 or more characters in the index, regardless of the location of the string (i.e. not just at the beginning of the word). SHORTKANA Includes short kana words (1 kana). Note, these must be full words of a single character. ALLASCII Includes every ASCII character in the index. ANYASCII Includes every ASCII string of 3 or more characters in the index, regardless of the location of the string (i.e. not just at the beginning of the word). SHORTASCII Include short ASCII words (2 or 1 characters). Note, these must be full words or 2 or 1 ASCII characters. ASCII2 Changes the normal ASCII acceptance string from 3 characters to 2 characters. SKIPNOTES Does not generate index entries for characters located in parenthese. This will exclude dictionary ID keys, and parenthetical nodes. UTF Changes the dictionary encoding from EUC to UTF-8. TEST Scan the actual file, but don’t create the index. This can be used to determine the size of an index file without actually generating the index, which is much faster. NOWARN Suppresses warning messages. 1250 Changes the code page from (1252, American/Western European) to Eastern European. 1251 Changes the code page from (1252, American/Western European) to Cyrillic. 1253 Changes the code page from (1252, American/Western European) to Greek. 1254 Changes the code page from (1252, American/Western European) to Turkish. 1255 Changes the code page from (1252, American/Western European) to Hebrew. 1256 Changes the code page from (1252, American/Western European) to Arabic. 1257 Changes the code page from (1252, American/Western European) to Baltic. 1258 Changes the code page from (1252, American/Western European) to Vietnamese. Dictionary files can be encoded in EUC or UTF-8. The index file for EUC dictionaries does not depend on the code page. The index for UTF-8 dictionaries does (at least at the current time). By default the code page is 1252 (American/Western European), but if you intend to use the index on some other system you must indicate the code page so the correct UNICODE conversion table can be used.
Indexing every character in the dictionary will generate an exceptionally large index file. In order to reduce the size of the index file some limitations are normally made on what sequences are normally indexed. The following table shows the default index conditions:
Kanji Every kanji in the file will be indexed. Symbols Most every symbol in the file will be indexed. There are not that many of these, so this does not increase the size of the index file much. Kana Kana sequences of 2 or more kana occurring at the beginning of a word are indexed. ASCII ASCII sequences of 3 or more characters occurring at the beginning of a word are indexed. Numbers A numerical sequences occurring at the beginning of a word are indexed. The number of these is small, so the size increase in the index is small. Many of these indexing conditions can be changed using the flags. All of the indexing flags, except for skipnotes, will increase the size of the index file.
It is important to understand the ALL, ANY, and SMALL flags. The easiest way to see what these do is to consider how they index some kana words. Consider indexing the words
and
, with various flags:
WARNING! This utility must sort the index into order. For a large index, this can take some time.
KINFO -- Character Information Utility
It is not convenient for JWPce to use Jim Breen’s KANJIDIC file directly. This is a basic text file, and is relatively large, as well as difficult to search through without loading all the information directly into memory. Instead JWPce uses KANJINFO.DAT file, which contains the same information in a more compact format. Further, the ability to quickly search through the data has been added.
This utility converts Jim Breen’s KANJIDIC into a binary format used by JWPce (KANJINFO.DAT). The format of this command is:
KINFO [EUC] [UTF8] [STATS] [IN=filename]If the file name is not specified, KINFO will assume KANJIDIC. This utility normally assumes the dictionary is in EUC, but will also support UTF-8. The STATS flag will cause information about the ranges and number of kanji including different indexes. I use this information to make modifications to KANJINFO.DAT.
This utility will write a number of files:
KANJINFO.DAT Large form of KANJINFO.DAT. This file contains all the information in KANJIDIC. KANJINFO.MED Medium form of KANJINFO.DAT. This file does not contain nanori, pinyin, or Korean entires. KANJINFO.SML Small form of KANJINFO.DAT. Reduced file that contains only the fixed size data (bushu, strokes, grade, skip, Halpern, nelson, and Haig), meanings, on-yomi, and kun-yomi. JWP_UNIC.DAT Contains UNICODE information for the kanji. This file was never used. JWPce actually uses the UNICODE conversions tables from the UNICODE Consortium (see UINFO below). KANJISRK.DAT Contains stroke information for the kanji. This files Is used by the RINFO utility to generate radical lookup data. KANJI_FREQ.EUC Obsolete file no longer generated. Contains the kanji by frequency index using Jack Halpern’s frequency data listed in KANJIDIC.
RINFO -- Radical Lookup Database Utility
This utility processes files used for the radical lookup feature. The utility takes no parameters, but reads a number of files:
kanjisrk.dat Kanji stroke data extracted from Jim Breen’s KNAJIDIC. radkanji.idx Index file for radical data. This data was first compiled by Michael Raine and Derc Yamaski. radkanji.dat Radical data file compiled by Michael Raine and Derc Yamaski. The files stroknji.idx and stroknji.dat can be read, but these stroke files compiled by Michael Raine and Derc Yamaski are no longer used.
The utility will write the following files:
stroke.euc EUC file containing the kanji by stroke count. radical.euc EUC file containing the kanji by radical. stroke.dat Stroke count database used by JWPce for radical lookup. radical.dat Radical database used by JWPce for radical lookup.
UINFO -- Unicode Conversion Utility
This utility generates the UNICODE conversion tables used by JWPce. These tables are stored as C code that is actually compiled into JWPce. This utility takes no parameters and reads the file JIS0208.TXT. This file is produced by the UNICODE Consortium. The utility writes the following files:
jwp_ukan.dat Conversion table for JIS kanji. jwp_umis.dat Conversion table for symbols. jwp_cp1250.dat Conversion table for Eastern Europe extended ASCII. jwp_cp1251.dat Conversion table for Cyrillic extended ASCII. jwp_cp1252.dat Conversion table for USA, West Europe extended ASCII. jwp_cp1253.dat Conversion table for Greek extended ASCII. jwp_cp1254.dat Conversion table for Turkish extended ASCII. jwp_cp1255.dat Conversion table for Hebrew extended ASCII. jwp_cp1256.dat Conversion table for Arabic extended ASCII. jwp_cp1257.dat Conversion table for Baltic extended ASCII. jwp_cp1258.dat Conversion table for Vietnamese extended ASCII.
WINFO -- Kana->Kanji Conversion Utility
This utility builds the kana->kanji conversion database used by JWPce. A number of different sources can go into the construction of this table. The syntax for calling the utility is:
WINFO filename [ alloc ]The alloc parameter determines the maximum number of conversions allocated. This must be more than the number of conversions you expect, because ether are usually duplicates that have to be removed. By default this parameter is set at 500,000.
If you compile this utility make sure the stack space is set quite high. The utility uses a quicksort algorithm to order the list. This can use a substantial amount of stack space. MS VC++ allocates a 1 MB stack by default. This is not enough to run the standard configuration. I normally allocate 20 MB, just to be safe. If the utility runs out of stack space you will get a system crash!
The filename parameter must specify a configuration file to read. The standard configuration file is called STANDARD.EUC.
The utility will write the files WNN.DAT and WNN.DIX. These are kana->kanji conversion database and index file. It is also possible for the utility to write the older format conversion database that was used by JWP. This has been disabled since these files are no longer used. For debugging purposes a number of other files will be written:
test1.euc Raw data read from all sources test2.euc Sorted data read from all sources test3.euc Filtered data read from all sources. Duplicates and unwanted entries are removed. Test4.euc Merged final data.
Configuration File
The configuration file is an EUC file containing a number of different commands. Each line should contain a single command. Blank lines are allowed, and any line beginning with a # is treated as a comment line. The following commands are supported:
DIC Extract kana->kanji conversions from a dictionary in EDICT format. EUC, UTF-8, and Mixed dictionaries are supported. END This command end the file. This command must be in the file. WNN Extract data from a WNN file. These files are normally produced by the WNN consortium. Older versions of these files (as are distributed with JWPce) are freely distributed. Newer versions are not. WLINE Contains a single kana->kanji conversion in the WNN format, but entered on a single line. I used to use these to make additions to the conversion database, but I have moved all of them into ROSENTHAL.U. DIC Entry
Entry specifies a dictionary in EDICT format. All or some valid kana->kanji conversions will be extracted from the file. Conversions with priority marking are assigned value 1. Conversions without priority markings are given value 0.
The format of the line is:
DIC ( ALL | PRIORITY ) filenameThe ALL options indicates extract all entries form the dictionary. The PRIORITY option indicates extract only priority entries. Such entries must end with a /(P)/.
Entries that do not contain kanji will automatically be skipped. As well as certain entries that mix character formats.
END Entry
Each configuration file must terminate in an END command. The format of the command is:
ENDWNN Entry
Extracts the kana->kanji conversions from a WNN formatted file. The format of the command is
WNN filenameMost of these data files were compiled by WNN consortium and are under copyright of Kyoto University Research Institute for Mathematical Sciences, although I have also created some.
The basic format of these files is as follows:
\total_blank_ entries Each of the entries has the form:
kana kanji part_of_speech valueYou can examine the files or check in the WINFO code to determine the details. The basics of each field are:
kana Kana for the conversion. Verb and adjective endings are not included. kanji Kanji for the conversion. Verb and adjective endings are not included. part_of_speech Indicates the part of speech. This is used to determine verb endings. The important parts of speech are: value Indicates the priority of the conversion. Higher priorities are listed earlier in the list.
Next Chapter: Support