Rip online dictionaries with Node.js part 1: static pages; CLI; DSL -> TXT, PDF, DjVu; the associated tasks
ABBYY created a good software for work with dictionaries, but not lower its contribution to digital lexicography became a byproduct of the development of ABBYY Lingvo is the dictionary markup language DSL. He had already gone beyond the borders Lingvo, became a standard format for other dictionary shells, including one of the most famous of its kind GoldenDict.
But by itself ABBYY would not have achieved such success without the help of a large army of enthusiasts, lexicographers, manic year after year ocifrovivaem paper dictionaries and digital dictionaries convertirovali — from miniature to huge special General purpose.
One of the most famous and fruitful group has long been working on the website forum.ru-board.com. Over time there has accumulated a vast collection of dictionaries and osnovatelnee knowledge base and tools to support their creators and editors. I have written a lot of scripts and programs, which reflects the history and changes of popularity of programming languages more or less suitable for text processing. There are Perl and Python, and languages batch files for shells, and macros MS Word and Excel, and compiled program on the General-purpose languages.
However, until recently, one of the languages has hardly been in this area. I would like to fill this gap and to pay tribute to the rapid growth of capacity, functionality, and popularity of the JavaScript language. I think it can be of great help to modern programmers, lexicographers, especially on the border of the network and local lexicography.
Create local copies of network dictionary is usually several steps: saving HTML pages with the help of programs like Teleport, cleaning them of tags using regular expressions (in text editors, macros or scripts), the final markup their DSL. JavaScript in it Node.js option can significantly reduce and ease this way, because this language is native for the WEB and knows how to operate on network data, not sinking to the precarious and changeable level code and regular expressions, but working at the level DOM.
I will try to illustrate the capabilities of the language and some of its libraries by the example of creating a local copy of one of the richest and most popular English explanatory dictionaries born in the network: Urban Dictionary. The fruits of recent efforts to estimate these distributions for popular trackers:
rutracker.org/forum/viewtopic.php?t=5106848
nnm-club.me/forum/viewtopic.php?t=951668
kinozal.tv/details.php?id=1389116
If you're not planning to maintain some kind of network dictionary, you can look from the third part of the article: it contains examples of other common tasks when working with electronic dictionaries that can be solved with the help of Node.js.
However, it should be noted that programming for me is just a hobby. This is both a warning about unprofessional further examples and encouragement for those who, like me, has only a liberal arts education.
Assumes that the reader knows JavaScript in its pure and applied options and understand the basics Node.js. If not, you will have to start from scratch or to fill in the blanks: JavaScript, DOM and Node.js.
In this article we will limit ourselves to the processing of static pages (by this we mean page, the key content is not changed when you disable JavaScript) console scripts. In subsequent parts we will examine the preservation of dynamic websites (content key which is generated by the scripts) and highlight programs with a GUI.
Because we will run scripts only on their computers and use the new language features, we recommend the latest version Node.js.
the
Possible at least three algorithms create a local copy of the network dictionary.
1. In the worst case the dictionary provides no reliable mechanism for iterating through all the articles. Then it is necessary to analyze the pattern of addresses and to set it straight the words of some more or less complete list of lexemes of the language (you can borrow a set of words from the largest digitized dictionary), culling of failed requests.
2. Some dictionaries allow you to go through the chain from the first to the last vocaboly (via the link "next word" or set of links for the next vocabula). This is the easiest way, but not the most obvious: it will be difficult to assess in advance the total number of words, and then monitor the copy progress. Therefore, although the same Urban Dictionary provides this functionality (on the page each word has a column of links to the nearest previous and subsequent articles), we will use the third way.
3. If the dictionary has a separate list of references to all entries, we copy the whole set of these links in the file. So we get an idea about the upcoming volume of requests and the ability to monitor the percentage done. For example, in the Urban Dictionary for references of type www.urbandictionary.com/browse.php?character=A, www.urbandictionary.com/browse.php?character=A&page=2 etc. you can get a list of the addresses of all articles with the words on that letter (such as www.urbandictionary.com/define.php?term=a, www.urbandictionary.com/define.php?term=a%5E_%5E etc.).
So the whole process of saving the dictionary will be divided into two phases, each will answer a separate script.
Here's the code first, keeping the list of links to dictionary articles:
UD.get_toc.js
1. In the initial part of the script we load the required library modules (or necessary methods of them if they will be called often). Almost all modules fitted, is installed with the installation Node.js. From the outside we only need the module jsdom itself Node.js not able to analyze HTML pages and turn them into a DOM tree, and this ability will provide us the mentioned module (installing modules simple, because with Node.js installed the plugin Manager npm; just open a console, navigate to the folder created by the script and type
After loading the modules the script specifies the folder in which to save the files (if the user did not specify a folder when you run the script the first command line selects the folder containing the script), and creating three future document: the address list of entries; list of processed pages, where he will take these addresses; a report on the errors that occurred.
At the end of the first part creates four service variables that will be stored:
— an array of the English alphabet (for alternately adding letters when creating URLS with links; the last one in the array is added to the symbol * responsible for a list of links to vocaboly beginning with special characters);
— previous and current URL query (to the error to determine whether we requested all the same ill-fated address, or the same error for the new address and must be included in the report);
— flag user interrupts the script.
2. In the second part we install handlers for two events: for any the script to finish (where we close all the files and call the function that the sound attracts the user's attention to any important event) and for the user to interrupt the program (invoked by pressing
3. In the third part we set up a loop query that will retrieve and save the address list entries. The part is divided into two logical sections.
a. If the report file already processed pages is empty, then the script starts from the beginning, not after a crash or user interrupts. In this case we output to the console window and in its title the first letter of the alphabet, extracted her from the alphabetical array and call the function to get the page built on the template URL.
b. If the file isn't empty, the script was already working. You need to extract from the file of the last processed address to make a request to the next in line. Because the report file can be large, we won't load it into memory entirely, but let's use the module to read the file line by line (just in case ignoring blank lines). When he reached the end, we will have the right address in a variable. Analyzing this address, we will receive a letter of the alphabet, which last processed the script, and set the page address list, the following are saved before exiting the program. Based on these data, we will reduce the array by the alphabetical letters including, print in the console a new start treatment and call a function to get the next page of the template, taking into account the desired letter and the desired page.
This procedural part of the script is terminated. This is followed by three functions: one office and two main, call each other by turns in a series of queries.
4. To chime in utility functions
5. Function
In the code commented out two additional features.
. If you plan to run several scripts in parallel to speed up downloading, better to do it through an anonymizing proxy server. I tested a bunch of Fiddler + Tor (non-browser version of the Expert Bundle) — though not used it throughout the script, as it simultaneously slows down the speed of communication with the server is a single process, and I didn't want to complicate the work of the division of the task into parts for parallel processes. An example implementation of the look here.
If you still want to parallelize the execution of your script, you will need, or you can specify at startup a different folder for the output files, or to run different copies of the script from different folders. These folders should contain files of the report about the processed pages, consisting of at least one line pointing to the address immediately preceding the specified portion of the address.
b. Another precaution from the ban from the server side is to use a delay between requests. Enough to wrap a method call in setTimeout and experiment with the size of the pauses. My experience has shown that servers Urban Dictionary enough natural pauses between the queries, no extra breaks, insert not required.
6. Function
First, the function checks the argument
If the argument is
a. On the page there is a batch of links to dictionary articles. Then the function writes a list of the addresses of these links in the content file dictionary and writes the address of the current page to a file of processed pages, according to the console the number of stored links and trying to find the URL of next page of addresses. If the search is successful, the program prints to the console information about the following request (letter and page number) and sends it to the familiar functions
b. If the links on the page, but the address is the same as requested, then most likely the error occurred on the server (this happens for example when the server gives an answer about the temporary unavailability). In this case, the script will retry the request.
V. If no links and the address does not match with the requested, there is a redirect. This is possible due to one particular address list of entries in Urban Dictionary: sometimes the expected number of pages with this list on the current drive letter is higher than the real number of pages, and when you try to request a nonexistent page number at the end of a literal block server forwards the user to the main page. In this case, the script goes to the next letter if the alphabetic array is not empty.
g If the array is empty, the script exits.
The result is a file with the contents of the dictionary. The other two files have a utility value, so you can delete them, read, as needed, with the occurring errors.
the
The structure of the second script is similar, the differences are largely the result of significant growth as time (it will now be measured in hours and days), and the complexity of processing pages:
UD.get_dic.js
1. In the first part we again loaded almost the same modules, then check the two key starting the program: first set the path to the folder with the input file (the script will search a list of links to dictionary articles, saved from the previous step), then the path to the folder for the new output files. In both cases, if the keys are not set, use the folder of the script. The script checks for the file from the link — if it will not, the program exits with an appropriate announcement.
Next, we Dene a few variables:
— regular expression for formatting of large numbers and the number of milliseconds in an hour — they will regularly be used in the future;
containers for permanent or temporary storage of data (list of links to dictionary articles, a list of headers of the current article and a list of its sections: different user interpretations of vocabula);
— familiar variables for the previous and forthcoming requests.
— variables to calculate and display the speed of the script;
— flag user disruption.
2. the Second part introduces the above-described event handlers: the completion of the script and the user command to interrupt the job.
3. In the third part we check the size of the primary output file, first, whether it is a program or another after the break. If the work is just beginning, we enter in the file dictionary of the future the BOM and primary Directive format DSL.
4. the Fourth part concludes the procedural section of the program. In it, we first read our input file with a list of links to dictionary articles in a container which will determine future requests (it will be like alphabetic the container from the first script will be the starting point of our cycle of requests). Then, as in the previous script, we check the report file already processed addresses: if there's something there, we find the final line, which contains the last successfully saved an article from the dictionary, reducing accordingly the array of addresses for future processing, memorize the number of remaining work and run the function that every hour would calculate the speed of the script and approximately predict the time of completion of work (in the hypothetical case of continuity). Then print to the console the number of remaining addresses (that's a big number, so we divide it discharges by spaces for better readability) and run our usual request cycle and save pages. If the report file is empty, we skip his reading, turning immediately to the second part of the following:
If is empty and the incoming file references of the dictionary entry, we inform the user and exit the program till better times.
This is followed by functions: multiple small office and the two main components of the familiar turns of cyclic queries.
5. Function
6. Function
7. Function
8. Function
9. Function
However, the beginning of the function has not changed: we still check the argument
If still no errors, we begin to analyze the page. The following turn of events.
and. On the page is the expected dictionary entry, the address of the page gives no reason to suspect the redirect.
In this case, we turn into an array of all the parts of a dictionary entry, that is, all user interpretation of the word. Then proceed to the analysis of each element.
Each user interpretation usually consists of three main sections: the header (it may be the main title of the article or be it a variant with minor deviations), interpretations and examples (the latter part optional).
All the headers we will collect in a special buffer (to avoid adding duplicates, we will use the new JavaScript data structure
As for extracting text from the DOM elements for the module
We then sequentially processed part of the interpretation and examples before saving them in temporary variables: remove whitespace at the beginning and end of lines, reduce multiple spaces, escapes the required characters, plug-in required for the body of the card the initial indentation, reinsured from loss empty separating lines in a future compilation of the DSL.
When you are finished with the main part, we save in the variable service information: voices for and against each interpretation and creation time of each interpretation (for removing the excess part appearing in an anonymous sub).
At the end we combine all the parts in the next element of the buffer accumulating the parts of a dictionary entry.
Then we check whether the article is several pages. If Yes, we request the next page to repeat the analysis and to increase our buffer headers and interpretations. If the article is one page (or if this is the last page of a multipage article), we record both the buffer to the dictionary file, a recorded address in the report file about successfully processed the addresses and derive information about the amount of stored interpretations and the speed of the program and clear the buffers before the next round of the cycle.
Further, if the array of addresses is not empty, we query the following dictionary entry. Otherwise, the program exits.
b. Expected article there, but the address is not talking about forwarding. This may be at least two reasons.
— The title of the article is left on the server or hit him by mistake — the interpretation has been removed or has not been created. In this case, it is the interpretation of a particular emoticon is inserted. If the program finds it, it enters a corresponding remark into the error file, displays a message to the console and proceeds to the next address (or crashes if the array of addresses is empty).
— If you don't see any error on the server. In this case, the script restarts the request the same address.
V. page Address is different from the expected pattern occurred during the redirect.
In this case, the program reports an error that requires user intervention (need to understand the reason for forwarding: you are banned on the server, the site has changed or happened something else, unrecorded previous experience), and exits.
If all goes well, in the end, we get ready the dictionary in the DSL. Service and intermediate files then you can remove (after analyzing the error file, if you didn't do this on the fly script).
the
In this part I will try to give examples of other tasks that you might encounter when creating and processing digital dictionaries, as well as in converting them in different formats.
the
The compiler ABBYY Lingvo requires UTF-16 encoding, but GoldenDict can work with UTF-8 encoding (which for dictionaries on the net Latin gives two times smaller file size). Some text editors with different speed of processing large files in these two encodings can also be a cause of conversion. Of course, you can resave the file in a different encoding with the editor, but it is not always the most fast and convenient way.
Node.js provides simple ways to convert large files, fast and requires low memory. Here are two examples for these encodings (the name of the new file before the extension is added, the label encoding):
utf8_2_utf16.js
utf16_2_utf8.js
the
As in the previous case, for large files this operation easier and more economical to produce scripts than text editors. Particularly valuable is working scripts if replacement has to be made dependent on different conditions.
Here are three examples with increasing complexity.
the
This script appeared in response to a request from the user to make sections of the examples from Urban Dictionary-concealed — that is, wrap them in the tags of the secondary display. Writing the script and create the requested version of the dictionary took a few minutes.
replace.js
For other simple cases it is enough to replace in this file the lines with regular expressions.
the
The need for such action arose when I noticed that the PDF printer converts the BOM into a space when creating the PDF file. Although there are other cases, when BOM throws a program confusing (especially in UTF-8 encoding).
deBOM.js
the
The following script is an example of substitutions made dependent on different conditions.
Different replacement trigger depending on accessories line to the DSL Directive, article title or body.
— Directives removed from a line with the keyword
— The words of the headings are translated to lower case if they are not in the exception array. Such a need may occur if all the characters of the headers dictionary are given in upper case. The array can be filled with proper names and abbreviations, you can store it in a separate file and load it at the beginning of the script.
— In the tags indented from the body of the card the number of the indentation is incremented (
You can use this script as a template for other replacements — just need to comment out the unused rows and edit necessary.
replace_in_dsl.js
All of the examples in this subsection, the encoding of incoming and outgoing files are specified in the code (examples of other possible encodings are commented out for possible replacement). But you can set it in the run key of the script, slightly altering the code. You can add two keys if the encoding of the old and the new file will be different.
the
The following script I counted the number of elements in the finished dictionary, Urban Dictionary, namely the number of headers, cards, interpretations inside of all cards and lines in the file.
count_dsl_elements.js
the
The following script will extract all headers from the dictionary DSL (for example, to build a list of language words in the largest dictionaries, or building exclusion list script, described above, transferring the headers in lower case).
extract_headwords.js
the
For large vocabularies it is better to check the uniqueness of the headers to compile, to correct errors in advance and not to repeat the long process of compiling several times. The following script displays the validation results in the console, and if there are duplicate lists them in a file.
check_headword_uniqueness.js
the
Readable text version of the dictionary, in addition to the individual values can be the basis for converting to formats PDF and DjVu. This leads to the features of the selected structure of the text file, in particular its rigid formatting the field with whitespace and a clear transfer lines at paragraph boundaries.
dsl_2_txt.js
The beginning of the script traditional. Added two new modules:
string-width — checks whether the character string occupying the width of the place two characters (as a rule, the Asian characters of the group CJK). Necessary for the correct calculation of the line length when splitting a paragraph into parts of the specified size (at the end of the file is determined by a small utility function
wordwrap: splits the paragraph into lines with a given width and possible equal spacing from the left edge.
Then follows a series of internal variables responsible for the increase in readability of text, split paragraphs into substrings, addition of basic and additional padding, delete tags DSL cancellation escaping, determining the borders of the cards.
Then, after registration, already familiar to us handler shutdown, the script looks in the folder of the dictionary file annotation and writes it in the generated text file (you can insert a check annotations, but, as a rule, it is still there).
In the end, the script reads the format of a DSL line by line, carefully observing the changing of the cards (with the transition from the end card to the first header following) and dumping, if necessary, the width of the padding, and performs the necessary actions depending on the object of analysis (the header or body of the card):
— removes escaping special characters, tags, DSL and "fuses" from the loss of blank lines;
— determines the need to add padding (if you come across tags [mN]) checks for "wide" characters (in such cases, I roughly divided the allowable line width in half, but you can add more fine-tuning depending on the number of such characters), sets the current line width and the need for a strict line-breaking within words (if you come across words that exceed the allotted line) and formats the paragraph accordingly;
— writes a paragraph formatted in a text file.
Examples of the resulting text file can be viewed in the screenshots in any of the above mentioned tracking of the hands.
the
Sometimes the program from which you have to print to PDF, does not have options for automatic placement of page numbers. For such cases, it may be useful to script pagination for the text file.
paginate.js
The script takes two mandatory arguments: the path to the file and the number of rows per page (which he reduces to two, to leave room for the page number after a blank line).
For easy formatting of numbers we download the module string that provides many functions for working with line data. In particular, we need a function that injects a row in the middle of the number of gaps of a given width, and a function that removes the right part of this series of post insertion. Thus we can put numbers in the middle rows to the bottom of the page.
In the procedural part of the script reads the incoming file line by line, and inserts the formatted page number and writes it all to the output file. At the end of the last number inserted into the bottom of the final page.
the
Here we present two examples of the divisions, depending on different parameters.
the
Sometimes a large vocabulary is easier to work with if you split it into alphabetical parts. Thus, for example, Adobe Acrobat is able to perceive all the parts as one unit, if to index the whole parent folder and then to search on the index all of its content.
split_by_abc.js
Before starting, the script creates a further report file in which to save the size information pieces (the number of lines in a file with a separate letter). This may be of interest as a theoretical (statistical) and practical (time processing units). However, this is optional complication.
Then creates an array with the alphabet (at the end of the symbol is placed from which to begin first the title of the last part of the dictionary containing vocaboly with special characters) and some auxiliary variables.
The procedural part begins with the function call
closes the previous file (if this is not the beginning of work);
— records the number of rows in the report file;
— resets the line counter;
— creates a new file with the next letter of the alphabet (make sure that the character of the last part of the dictionary was not of the number system is prohibited for use in file names);
— enter the BOM in the beginning of a new part (the first part of it will be transferred automatically from a dictionary);
— displays upcoming letter of the alphabet to the console and to the report file;
— if the array of the alphabet is not empty, forms a regular expression which will be checked line from the input file to catch the transition to the new alphabetical part.
After you create the file for the first part follows a cycle of reading the entire dictionary: each line is checked mentioned regular expression, if necessary, create a new file; the current file is written to the text of the next part, the row count for the reporting file grows; at the end of the cycle is recorded, the size of the last part and all files are closed.
This script can also be used to split huge files DSL that are too tough for the compiler ABBYY (for example, in recent versions of Lingvo when trying to compile dictionaries "Multitran" of the voluminous Russian-English directions, the compiler crashes with the message "out of memory"). In this case, you may need a dictionary just split into two parts, the middle letter of the alphabet (so the above mentioned alphabetical array can be reduced to one letter).
the
When I tried to print to PDF the entire dictionary from a text editor for larger files, I came across a limitation: the editor sent to the printer up to 65,000 pages. I had to divide the dictionary in chunks of 65000 pages, print them separately and then combine the parts in Adobe Acrobat. This was written the following script.
split_by_pages.js
The script is similar to the previous with a few exceptions:
the script takes three mandatory arguments when running: the path to a dictionary, the number of lines per page number of pages on one part of the split-dictionary;
report file is not created, because the size of the parts will be relatively the same;
— the function of creating new parts is simplified (there is no record in the report file and create a validation regular expression) in the name part instead of the letters of the alphabet introduces a number of the first page in the numbering;
— cycle read/write creates a new file, not based on the first letters of the titles a vocabulary part, but on the number of rows and pages.
That's about all I can share with the creators and editors of electronic dictionaries that wish to experience the possibilities Node.js. Thanks for lying Article based on information from habrahabr.ru
But by itself ABBYY would not have achieved such success without the help of a large army of enthusiasts, lexicographers, manic year after year ocifrovivaem paper dictionaries and digital dictionaries convertirovali — from miniature to huge special General purpose.
One of the most famous and fruitful group has long been working on the website forum.ru-board.com. Over time there has accumulated a vast collection of dictionaries and osnovatelnee knowledge base and tools to support their creators and editors. I have written a lot of scripts and programs, which reflects the history and changes of popularity of programming languages more or less suitable for text processing. There are Perl and Python, and languages batch files for shells, and macros MS Word and Excel, and compiled program on the General-purpose languages.
However, until recently, one of the languages has hardly been in this area. I would like to fill this gap and to pay tribute to the rapid growth of capacity, functionality, and popularity of the JavaScript language. I think it can be of great help to modern programmers, lexicographers, especially on the border of the network and local lexicography.
Create local copies of network dictionary is usually several steps: saving HTML pages with the help of programs like Teleport, cleaning them of tags using regular expressions (in text editors, macros or scripts), the final markup their DSL. JavaScript in it Node.js option can significantly reduce and ease this way, because this language is native for the WEB and knows how to operate on network data, not sinking to the precarious and changeable level code and regular expressions, but working at the level DOM.
I will try to illustrate the capabilities of the language and some of its libraries by the example of creating a local copy of one of the richest and most popular English explanatory dictionaries born in the network: Urban Dictionary. The fruits of recent efforts to estimate these distributions for popular trackers:
rutracker.org/forum/viewtopic.php?t=5106848
nnm-club.me/forum/viewtopic.php?t=951668
kinozal.tv/details.php?id=1389116
If you're not planning to maintain some kind of network dictionary, you can look from the third part of the article: it contains examples of other common tasks when working with electronic dictionaries that can be solved with the help of Node.js.
However, it should be noted that programming for me is just a hobby. This is both a warning about unprofessional further examples and encouragement for those who, like me, has only a liberal arts education.
Assumes that the reader knows JavaScript in its pure and applied options and understand the basics Node.js. If not, you will have to start from scratch or to fill in the blanks: JavaScript, DOM and Node.js.
In this article we will limit ourselves to the processing of static pages (by this we mean page, the key content is not changed when you disable JavaScript) console scripts. In subsequent parts we will examine the preservation of dynamic websites (content key which is generated by the scripts) and highlight programs with a GUI.
Because we will run scripts only on their computers and use the new language features, we recommend the latest version Node.js.
the
I. Preliminary stage: fetching the list of address entries
Possible at least three algorithms create a local copy of the network dictionary.
1. In the worst case the dictionary provides no reliable mechanism for iterating through all the articles. Then it is necessary to analyze the pattern of addresses and to set it straight the words of some more or less complete list of lexemes of the language (you can borrow a set of words from the largest digitized dictionary), culling of failed requests.
2. Some dictionaries allow you to go through the chain from the first to the last vocaboly (via the link "next word" or set of links for the next vocabula). This is the easiest way, but not the most obvious: it will be difficult to assess in advance the total number of words, and then monitor the copy progress. Therefore, although the same Urban Dictionary provides this functionality (on the page each word has a column of links to the nearest previous and subsequent articles), we will use the third way.
3. If the dictionary has a separate list of references to all entries, we copy the whole set of these links in the file. So we get an idea about the upcoming volume of requests and the ability to monitor the percentage done. For example, in the Urban Dictionary for references of type www.urbandictionary.com/browse.php?character=A, www.urbandictionary.com/browse.php?character=A&page=2 etc. you can get a list of the addresses of all articles with the words on that letter (such as www.urbandictionary.com/define.php?term=a, www.urbandictionary.com/define.php?term=a%5E_%5E etc.).
So the whole process of saving the dictionary will be divided into two phases, each will answer a separate script.
Here's the code first, keeping the list of links to dictionary articles:
UD.get_toc.js
1. In the initial part of the script we load the required library modules (or necessary methods of them if they will be called often). Almost all modules fitted, is installed with the installation Node.js. From the outside we only need the module jsdom itself Node.js not able to analyze HTML pages and turn them into a DOM tree, and this ability will provide us the mentioned module (installing modules simple, because with Node.js installed the plugin Manager npm; just open a console, navigate to the folder created by the script and type
npm install jsdom
, and then wait for downloading and installing the necessary module and subordinate modules will be installed in folder node_modules
, where they will be looking for our script).After loading the modules the script specifies the folder in which to save the files (if the user did not specify a folder when you run the script the first command line selects the folder containing the script), and creating three future document: the address list of entries; list of processed pages, where he will take these addresses; a report on the errors that occurred.
At the end of the first part creates four service variables that will be stored:
— an array of the English alphabet (for alternately adding letters when creating URLS with links; the last one in the array is added to the symbol * responsible for a list of links to vocaboly beginning with special characters);
— previous and current URL query (to the error to determine whether we requested all the same ill-fated address, or the same error for the new address and must be included in the report);
— flag user interrupts the script.
2. In the second part we install handlers for two events: for any the script to finish (where we close all the files and call the function that the sound attracts the user's attention to any important event) and for the user to interrupt the program (invoked by pressing
Ctrl+C
and switches the interrupt flag, which is checked before each new network request).3. In the third part we set up a loop query that will retrieve and save the address list entries. The part is divided into two logical sections.
a. If the report file already processed pages is empty, then the script starts from the beginning, not after a crash or user interrupts. In this case we output to the console window and in its title the first letter of the alphabet, extracted her from the alphabetical array and call the function to get the page built on the template URL.
b. If the file isn't empty, the script was already working. You need to extract from the file of the last processed address to make a request to the next in line. Because the report file can be large, we won't load it into memory entirely, but let's use the module to read the file line by line (just in case ignoring blank lines). When he reached the end, we will have the right address in a variable. Analyzing this address, we will receive a letter of the alphabet, which last processed the script, and set the page address list, the following are saved before exiting the program. Based on these data, we will reduce the array by the alphabetical letters including, print in the console a new start treatment and call a function to get the next page of the template, taking into account the desired letter and the desired page.
This procedural part of the script is terminated. This is followed by three functions: one office and two main, call each other by turns in a series of queries.
4. To chime in utility functions
playAlert()
I chose a console cross-platform player from the library ffmpeg (startup keys, see the developers website), but you can use any other player or any module in the sound generation system means. The sound, too, can choose any other.5. Function
getDoc(url)
sends a request for obtaining the next page address list entries. First she checks that did not require a user to stop the script (the script works for a few hours, so you may need to break). The function then updates the variables of the past and the upcoming request. Finally, she instructs the jsdom module to request the page at the same time transmitting a corresponding method function, which will need to call the receiving page.In the code commented out two additional features.
. If you plan to run several scripts in parallel to speed up downloading, better to do it through an anonymizing proxy server. I tested a bunch of Fiddler + Tor (non-browser version of the Expert Bundle) — though not used it throughout the script, as it simultaneously slows down the speed of communication with the server is a single process, and I didn't want to complicate the work of the division of the task into parts for parallel processes. An example implementation of the look here.
If you still want to parallelize the execution of your script, you will need, or you can specify at startup a different folder for the output files, or to run different copies of the script from different folders. These folders should contain files of the report about the processed pages, consisting of at least one line pointing to the address immediately preceding the specified portion of the address.
b. Another precaution from the ban from the server side is to use a delay between requests. Enough to wrap a method call in setTimeout and experiment with the size of the pauses. My experience has shown that servers Urban Dictionary enough natural pauses between the queries, no extra breaks, insert not required.
6. Function
processDoc(err, window)
module calls jsdom
, receiving a page or stumbled on a bug — hence the two corresponding argument of the function.First, the function checks the argument
err
: if defined, then the request was unsuccessful. In this case, the script will audibly signal, and then writes a message to the error file (if this is the first error with that URL, but not the next in the chain of repeated queries), displays the information in the window and the title for the console and restarts the request, referring to the function getDoc(url)
with the repetition of the argument.If the argument is
err
is empty, the function starts to parse the received document. There are several possible outcomes and reactions to them.a. On the page there is a batch of links to dictionary articles. Then the function writes a list of the addresses of these links in the content file dictionary and writes the address of the current page to a file of processed pages, according to the console the number of stored links and trying to find the URL of next page of addresses. If the search is successful, the program prints to the console information about the following request (letter and page number) and sends it to the familiar functions
getDoc(url)
. If the lookup fails, the program checks the alphabetical array: if there were letters, they move on to a new one if it is empty — shut down.b. If the links on the page, but the address is the same as requested, then most likely the error occurred on the server (this happens for example when the server gives an answer about the temporary unavailability). In this case, the script will retry the request.
V. If no links and the address does not match with the requested, there is a redirect. This is possible due to one particular address list of entries in Urban Dictionary: sometimes the expected number of pages with this list on the current drive letter is higher than the real number of pages, and when you try to request a nonexistent page number at the end of a literal block server forwards the user to the main page. In this case, the script goes to the next letter if the alphabetic array is not empty.
g If the array is empty, the script exits.
The result is a file with the contents of the dictionary. The other two files have a utility value, so you can delete them, read, as needed, with the occurring errors.
the
II. Main stage: the texts of dictionary entries
The structure of the second script is similar, the differences are largely the result of significant growth as time (it will now be measured in hours and days), and the complexity of processing pages:
UD.get_dic.js
1. In the first part we again loaded almost the same modules, then check the two key starting the program: first set the path to the folder with the input file (the script will search a list of links to dictionary articles, saved from the previous step), then the path to the folder for the new output files. In both cases, if the keys are not set, use the folder of the script. The script checks for the file from the link — if it will not, the program exits with an appropriate announcement.
Next, we Dene a few variables:
— regular expression for formatting of large numbers and the number of milliseconds in an hour — they will regularly be used in the future;
containers for permanent or temporary storage of data (list of links to dictionary articles, a list of headers of the current article and a list of its sections: different user interpretations of vocabula);
— familiar variables for the previous and forthcoming requests.
— variables to calculate and display the speed of the script;
— flag user disruption.
2. the Second part introduces the above-described event handlers: the completion of the script and the user command to interrupt the job.
3. In the third part we check the size of the primary output file, first, whether it is a program or another after the break. If the work is just beginning, we enter in the file dictionary of the future the BOM and primary Directive format DSL.
4. the Fourth part concludes the procedural section of the program. In it, we first read our input file with a list of links to dictionary articles in a container which will determine future requests (it will be like alphabetic the container from the first script will be the starting point of our cycle of requests). Then, as in the previous script, we check the report file already processed addresses: if there's something there, we find the final line, which contains the last successfully saved an article from the dictionary, reducing accordingly the array of addresses for future processing, memorize the number of remaining work and run the function that every hour would calculate the speed of the script and approximately predict the time of completion of work (in the hypothetical case of continuity). Then print to the console the number of remaining addresses (that's a big number, so we divide it discharges by spaces for better readability) and run our usual request cycle and save pages. If the report file is empty, we skip his reading, turning immediately to the second part of the following:
If is empty and the incoming file references of the dictionary entry, we inform the user and exit the program till better times.
This is followed by functions: multiple small office and the two main components of the familiar turns of cyclic queries.
5. Function
playAlert()
is no different from the name of the first script.6. Function
secure(str, isHeadword)
we will regularly use when saving the dictionary entries in the DSL file. She has two tasks: to translate encountered, oddly enough, in network text control characters (characters from the initial block ASCII) arbitrarily in a readable form that will not confuse the compiler of the DSL; and cut too long words of the articles beyond the boundary requirements of the DSL format (headings will be reduced by other rules).7. Function
setSpeedInfo()
will run parallel to the main course of the program. Every hour it will replace the information line that displays the speed of the script and the remaining time (at the beginning of the work in the line will be very questionable that after the first hour will be replaced by numbers). The work function is quite transparent, it is only necessary to make two remarks: in the variable restMark
stores the number of remaining addresses at the time of the previous computation of the velocity; the launch of the chime for recalculation of the speed is done in this function asynchronously (that is, the script does not wait for end of sound) — to do this, first we saved in a variable separate method asynchronous launch child processes.8. Function
getDoc(url)
that sends a network request is no different from that described in the previous section, including the commented precautionary ban from the server and means of expediting the work.9. Function
processDoc(err, window)
, compared with the same of the previous script, will save the frame, but will differ significantly in terms of processing and storing information received from a page — because we will not just record the set of links, but to analyze and transform entire data block.However, the beginning of the function has not changed: we still check the argument
err
and, if it is determined, by a recorded information in the report file error and restart the failed request.If still no errors, we begin to analyze the page. The following turn of events.
and. On the page is the expected dictionary entry, the address of the page gives no reason to suspect the redirect.
In this case, we turn into an array of all the parts of a dictionary entry, that is, all user interpretation of the word. Then proceed to the analysis of each element.
Each user interpretation usually consists of three main sections: the header (it may be the main title of the article or be it a variant with minor deviations), interpretations and examples (the latter part optional).
All the headers we will collect in a special buffer (to avoid adding duplicates, we will use the new JavaScript data structure
Set
, retaining only unique elements). Before that, each header we will drive through the function secure(str, isHeadword)
, then create two options: the header for the header part of the DSL and the header for the section start card DSL — because these areas have different requirements. In each variant we will be screening the desired characters. The first option before being placed in the buffer, we will reduce according to the requirements of the format, if it is too long.As for extracting text from the DOM elements for the module
jsdom
property is textContent
, has several disadvantages, we additionally reinsured by the loss of line breaks, inserting them advanced character options in front of some br
.We then sequentially processed part of the interpretation and examples before saving them in temporary variables: remove whitespace at the beginning and end of lines, reduce multiple spaces, escapes the required characters, plug-in required for the body of the card the initial indentation, reinsured from loss empty separating lines in a future compilation of the DSL.
When you are finished with the main part, we save in the variable service information: voices for and against each interpretation and creation time of each interpretation (for removing the excess part appearing in an anonymous sub).
At the end we combine all the parts in the next element of the buffer accumulating the parts of a dictionary entry.
Then we check whether the article is several pages. If Yes, we request the next page to repeat the analysis and to increase our buffer headers and interpretations. If the article is one page (or if this is the last page of a multipage article), we record both the buffer to the dictionary file, a recorded address in the report file about successfully processed the addresses and derive information about the amount of stored interpretations and the speed of the program and clear the buffers before the next round of the cycle.
Further, if the array of addresses is not empty, we query the following dictionary entry. Otherwise, the program exits.
b. Expected article there, but the address is not talking about forwarding. This may be at least two reasons.
— The title of the article is left on the server or hit him by mistake — the interpretation has been removed or has not been created. In this case, it is the interpretation of a particular emoticon is inserted. If the program finds it, it enters a corresponding remark into the error file, displays a message to the console and proceeds to the next address (or crashes if the array of addresses is empty).
— If you don't see any error on the server. In this case, the script restarts the request the same address.
V. page Address is different from the expected pattern occurred during the redirect.
In this case, the program reports an error that requires user intervention (need to understand the reason for forwarding: you are banned on the server, the site has changed or happened something else, unrecorded previous experience), and exits.
If all goes well, in the end, we get ready the dictionary in the DSL. Service and intermediate files then you can remove (after analyzing the error file, if you didn't do this on the fly script).
the
III. Further operation with the received dictionary
In this part I will try to give examples of other tasks that you might encounter when creating and processing digital dictionaries, as well as in converting them in different formats.
the
1. Change the encoding
The compiler ABBYY Lingvo requires UTF-16 encoding, but GoldenDict can work with UTF-8 encoding (which for dictionaries on the net Latin gives two times smaller file size). Some text editors with different speed of processing large files in these two encodings can also be a cause of conversion. Of course, you can resave the file in a different encoding with the editor, but it is not always the most fast and convenient way.
Node.js provides simple ways to convert large files, fast and requires low memory. Here are two examples for these encodings (the name of the new file before the extension is added, the label encoding):
utf8_2_utf16.js
utf16_2_utf8.js
the
2. Replace text
As in the previous case, for large files this operation easier and more economical to produce scripts than text editors. Particularly valuable is working scripts if replacement has to be made dependent on different conditions.
Here are three examples with increasing complexity.
the
. Easy to replace
This script appeared in response to a request from the user to make sections of the examples from Urban Dictionary-concealed — that is, wrap them in the tags of the secondary display. Writing the script and create the requested version of the dictionary took a few minutes.
replace.js
For other simple cases it is enough to replace in this file the lines with regular expressions.
the
is used. Remove BOM from the beginning of the file
The need for such action arose when I noticed that the PDF printer converts the BOM into a space when creating the PDF file. Although there are other cases, when BOM throws a program confusing (especially in UTF-8 encoding).
deBOM.js
the
the. replacements
The following script is an example of substitutions made dependent on different conditions.
Different replacement trigger depending on accessories line to the DSL Directive, article title or body.
— Directives removed from a line with the keyword
#ICON_FILE
inserted at times when decompiling LSD dictionaries (even when its built-in icons in the dictionary, none).— The words of the headings are translated to lower case if they are not in the exception array. Such a need may occur if all the characters of the headers dictionary are given in upper case. The array can be filled with proper names and abbreviations, you can store it in a separate file and load it at the beginning of the script.
— In the tags indented from the body of the card the number of the indentation is incremented (
[m1]
becomes [m2]
, etc.).You can use this script as a template for other replacements — just need to comment out the unused rows and edit necessary.
replace_in_dsl.js
All of the examples in this subsection, the encoding of incoming and outgoing files are specified in the code (examples of other possible encodings are commented out for possible replacement). But you can set it in the run key of the script, slightly altering the code. You can add two keys if the encoding of the old and the new file will be different.
the
3. Counting of elements
The following script I counted the number of elements in the finished dictionary, Urban Dictionary, namely the number of headers, cards, interpretations inside of all cards and lines in the file.
count_dsl_elements.js
the
4. Removing elements
The following script will extract all headers from the dictionary DSL (for example, to build a list of language words in the largest dictionaries, or building exclusion list script, described above, transferring the headers in lower case).
extract_headwords.js
the
5. Check uniqueness of elements
For large vocabularies it is better to check the uniqueness of the headers to compile, to correct errors in advance and not to repeat the long process of compiling several times. The following script displays the validation results in the console, and if there are duplicate lists them in a file.
check_headword_uniqueness.js
the
6. Conversion of DSL to a text format
Readable text version of the dictionary, in addition to the individual values can be the basis for converting to formats PDF and DjVu. This leads to the features of the selected structure of the text file, in particular its rigid formatting the field with whitespace and a clear transfer lines at paragraph boundaries.
dsl_2_txt.js
The beginning of the script traditional. Added two new modules:
string-width — checks whether the character string occupying the width of the place two characters (as a rule, the Asian characters of the group CJK). Necessary for the correct calculation of the line length when splitting a paragraph into parts of the specified size (at the end of the file is determined by a small utility function
hasCJK(str)
).wordwrap: splits the paragraph into lines with a given width and possible equal spacing from the left edge.
Then follows a series of internal variables responsible for the increase in readability of text, split paragraphs into substrings, addition of basic and additional padding, delete tags DSL cancellation escaping, determining the borders of the cards.
Then, after registration, already familiar to us handler shutdown, the script looks in the folder of the dictionary file annotation and writes it in the generated text file (you can insert a check annotations, but, as a rule, it is still there).
In the end, the script reads the format of a DSL line by line, carefully observing the changing of the cards (with the transition from the end card to the first header following) and dumping, if necessary, the width of the padding, and performs the necessary actions depending on the object of analysis (the header or body of the card):
— removes escaping special characters, tags, DSL and "fuses" from the loss of blank lines;
— determines the need to add padding (if you come across tags [mN]) checks for "wide" characters (in such cases, I roughly divided the allowable line width in half, but you can add more fine-tuning depending on the number of such characters), sets the current line width and the need for a strict line-breaking within words (if you come across words that exceed the allotted line) and formats the paragraph accordingly;
— writes a paragraph formatted in a text file.
Examples of the resulting text file can be viewed in the screenshots in any of the above mentioned tracking of the hands.
the
7. Insert a text file page numbers
Sometimes the program from which you have to print to PDF, does not have options for automatic placement of page numbers. For such cases, it may be useful to script pagination for the text file.
paginate.js
The script takes two mandatory arguments: the path to the file and the number of rows per page (which he reduces to two, to leave room for the page number after a blank line).
For easy formatting of numbers we download the module string that provides many functions for working with line data. In particular, we need a function that injects a row in the middle of the number of gaps of a given width, and a function that removes the right part of this series of post insertion. Thus we can put numbers in the middle rows to the bottom of the page.
In the procedural part of the script reads the incoming file line by line, and inserts the formatted page number and writes it all to the output file. At the end of the last number inserted into the bottom of the final page.
the
8. Splitting file into parts.
Here we present two examples of the divisions, depending on different parameters.
the
. The division of the letters of the alphabet
Sometimes a large vocabulary is easier to work with if you split it into alphabetical parts. Thus, for example, Adobe Acrobat is able to perceive all the parts as one unit, if to index the whole parent folder and then to search on the index all of its content.
split_by_abc.js
Before starting, the script creates a further report file in which to save the size information pieces (the number of lines in a file with a separate letter). This may be of interest as a theoretical (statistical) and practical (time processing units). However, this is optional complication.
Then creates an array with the alphabet (at the end of the symbol is placed from which to begin first the title of the last part of the dictionary containing vocaboly with special characters) and some auxiliary variables.
The procedural part begins with the function call
newPart(chr)
that will run every time you need to create a file to store the next part. This function:closes the previous file (if this is not the beginning of work);
— records the number of rows in the report file;
— resets the line counter;
— creates a new file with the next letter of the alphabet (make sure that the character of the last part of the dictionary was not of the number system is prohibited for use in file names);
— enter the BOM in the beginning of a new part (the first part of it will be transferred automatically from a dictionary);
— displays upcoming letter of the alphabet to the console and to the report file;
— if the array of the alphabet is not empty, forms a regular expression which will be checked line from the input file to catch the transition to the new alphabetical part.
After you create the file for the first part follows a cycle of reading the entire dictionary: each line is checked mentioned regular expression, if necessary, create a new file; the current file is written to the text of the next part, the row count for the reporting file grows; at the end of the cycle is recorded, the size of the last part and all files are closed.
This script can also be used to split huge files DSL that are too tough for the compiler ABBYY (for example, in recent versions of Lingvo when trying to compile dictionaries "Multitran" of the voluminous Russian-English directions, the compiler crashes with the message "out of memory"). In this case, you may need a dictionary just split into two parts, the middle letter of the alphabet (so the above mentioned alphabetical array can be reduced to one letter).
the
is used. Split by number of pages
When I tried to print to PDF the entire dictionary from a text editor for larger files, I came across a limitation: the editor sent to the printer up to 65,000 pages. I had to divide the dictionary in chunks of 65000 pages, print them separately and then combine the parts in Adobe Acrobat. This was written the following script.
split_by_pages.js
The script is similar to the previous with a few exceptions:
the script takes three mandatory arguments when running: the path to a dictionary, the number of lines per page number of pages on one part of the split-dictionary;
report file is not created, because the size of the parts will be relatively the same;
— the function of creating new parts is simplified (there is no record in the report file and create a validation regular expression) in the name part instead of the letters of the alphabet introduces a number of the first page in the numbering;
— cycle read/write creates a new file, not based on the first letters of the titles a vocabulary part, but on the number of rows and pages.
That's about all I can share with the creators and editors of electronic dictionaries that wish to experience the possibilities Node.js. Thanks for lying Article based on information from habrahabr.ru
Комментарии
Отправить комментарий