Сообщения

Сообщения за ноябрь, 2017

The BM25 algorithm

Изображение
the First time this algorithm was met on Wikipedia and didn't pay him much attention. Later, studying the scientific works of employees of Yandex, I noticed that they refer to it, for example, article Segalovich about the algorithms determining near-duplicate, so I decided to figure out what is the meaning of its use. Try simple examples to explain it. So, what this algorithm is? First. Introduced a dependency of relevance on the occurrence or non-occurrence of words in queries with more than one word. Let there are several queries consisting of multiple words, for example (the example is purely illustrative): the the by Samsung the to buy a Samsung Galaxy smartphone Let compares two documents (again illustrative) and the first document does not contain word Galaxy. According to calculations the estimate of this sum of relevantnoise each of the words. The relevance of each of the words is equal to its IDF * on the second factor in the above expression

Natural language processing: the missing tool

Изображение
Suppose You want to create a web application. In the modern world, created countless SOFTWARE designed to make your life easier. Can use some comprehensive framework, or connect a pair of libraries that solve common tasks like standardization, database management, interactivity, and the like. These libraries provide a consistent interface to solving both common and exceptional tasks that You may not be able immediately to cope. But among this abundance of tools, there is a significant gap: library for working with natural languages. "But such a lot!" — would You mind — " NLTK or lingpipe ." Of course, but do You use them? "So my project sort of not required natural language processing." But actually, You are processing natural language without even realizing it. Such a basic action as the bonding of lines is only a special case of generating texts in a natural language, one of the fundamental parts of OEA 1 . But if You will need

How not to lose data in PostgreSQL

PostgreSQL offers several options for data backup. All of them have already told more than once, including on habré. But mainly describes about the technical features of the methods. I want to try to talk about General backup strategy, combining all methods into an effective system that will help you to save all the data and reduce the number of deaths of nerve cells in critical situations. Inputs: PostgreSQL 9.2 server, the Database size is > 100Gb. backup Options As can be seen from manuals , there are 3 methods of backup: the the Streaming replication the Copy database files the SQL dump They all have their own characteristics, so we use all of these methods. Streaming replication Setup streaming replication is well described in articles here and here . The meaning of this replication is that if the primary server goes down, the slave can quickly make a master and work with him, because the slave is a complete copy of the database. In streaming

Testing stored functions using pgTAP

Изображение
I Recently posted article with a "skeleton" schema that can be used to create its schemas of PostgreSQL. In addition to scripts deployment diagrams, object creation, there were examples of stored functions and Unit tests for them. In this article I want for example pg_skeleton to elaborate on how to write tests for stored functions in PostgreSQL using the pgTAP . The pgTAP tests, as the name implies, the output given the text in plain text format TAP (Test Anything Protocol) . This format is accepted by many CI systems. We use Jenkins TAP Plugin . When you install the extension in database create a stored function (by default in the public schema), which we will use when writing tests. A large part of the functions of the various assertions. The full list can be found here: http://pgtap.org/documentation.html Test will function from the example schema test_user: First installed pg_skeleton . (If you want to write tests in your diagram from the inst

Make it easy on yourself

In the development of projects use PostgreSQL, because of its openness, the free and quite great functionality. The adopted architecture for the tables creating a view (views) and applications work with them. In many cases, views are one to one copy of the table, but each of them need to create and write the rules for updating, deleting and inserting records, which takes time. And in one day, I got bored and I decided to automate this process. So there is the following function: the CREATE OR REPLACE FUNCTION pg_createview(table_ text, schema_ text) RETURNS integer AS $BODY$ DECLARE obj record; num integer; _schema alias for $2; _tablelike alias for $1; _table character varying; sql character varying; sqlclm1 character varying; sqlclm2 character varying; sqlclmkey character varying; _col text; exist_view character varying; BEGIN num:=0; FOR obj IN SELECT relname FROM pg_class c JOIN pg_namespace ns ON (c.relnamespace = ns.oid) WHERE relkind ='r' AND nspname = $2 AN

Monitoring PostgreSQL with Zabbix

Изображение
PostgreSQL is a modern, dynamic DBMS with a very large set of possibilities which allow to solve a wide range of tasks. Using PostgreSQL as a rule refers to a very critical segment of the it infrastructure associated with the processing and data storage. Given the special place of the DBMS infrastructure and the criticality of tasks, the question of monitoring and appropriate control of the DBMS. In this respect, PostgreSQL has a wide internal means of collecting and storing statistics. Collected statistics allows to obtain quite a detailed picture about what is happening under the hood during operation of the DBMS. These statistics are stored in special system tables, views, and constantly updated. Performing regular SQL queries to these tables it is possible to obtain a variety of information about databases, tables, indexes and other DBMS subsystems. Below I describe the method and tools for monitoring PostgreSQL monitoring system Zabbix. I love this monitoring system becaus

The Postgis Geometry data type, for example imported from OpenStreetMap maps

For my project needed to build the pedestrian routes and to consider their length. I solved this problem by using pgrouting , which in turn relies on postgis . Postgis is an extension to Postgresql that implements the standard OpenGis . In this extension contains extensive functionality for working with spatial data. This allows you to write interesting applications. In particular, OpenStreetMap is using postgis to display your cards. I will try to tell about how to look at postgis, imported osm maps. I'll skip the description of how to install postgres and postgis. Start with creating a database to store the spatial data. Create database: the create database openstreetmap; Initialize postgis: the create extension postgis; We have in the database will appear in the table spatial_ref_sys, which is nearly 4 thousand records. Each entry corresponds to a spatial coordinate system that defines the projection of longitude and latitude onto a pla

Fasten a spatial index to the unsuspecting OpenSource DBMS

Изображение
I always liked it when the title clearly says what will happen next, for example, "Texas chainsaw massacre". So under the cut we add spatial search to the DBMS in which it initially had. introductory Let's dispense with General words about the importance and usefulness of spatial search. To the question, why not take the open DBMS with embedded spatial search, the answer is this: "I want to get something done right, do it yourself"(With). But seriously, we will try to show that the most practical spatial problems can be solved in quite an ordinary desktop without the stress and extra costs. As an experimental DBMS will use the open edition of the OpenLink Virtuoso . In its paid version available spatial indexing, but that's another story. Why this database? The author likes her. Fast, easy, powerful. Copied-launched-all works. Will only use regular tools With no plug-ins or special assemblies, only official build and a text editor. in

MODX Revolution meets Fenom

recently in the English-speaking community of MODX a lot of discussion on the topic "how we live". All at stoppage to discuss the upcoming (in a few years, I believe) major version 3, and we still are improving their add-ons current. A new event that I would like to share with a wider audience is the new release of pdoTools Fenom template engine , which will allow you to get rid of the clutter of tags in terms of chunks and rewrite them in plain language the template engine. The procedure does not require changes in the site, simply update pdoTools to version 2.0 and you can use new syntax . The best part is that the MODX tags are the perfect complement to Fenom and work together without any problems. A simple example for starters: the {if $parent == 3} [[!pdoMenu?parents=`0`]] {else} [[!pdoResources?parents=`1,2,3`]] {/if} Under the cut is a huge amount of information about the parser pdoTools that I have never collected in one place. So, the parser pdoTools