Fasten a spatial index to the unsuspecting OpenSource DBMS

I always liked it when the title clearly says what will happen next, for example, "Texas chainsaw massacre". So under the cut we add spatial search to the DBMS in which it initially had.

introductory
Let's dispense with General words about the importance and usefulness of spatial search.
To the question, why not take the open DBMS with embedded spatial search, the answer is this: "I want to get something done right, do it yourself"(With). But seriously, we will try to show that the most practical spatial problems can be solved in quite an ordinary desktop without the stress and extra costs.

As an experimental DBMS will use the open edition of the OpenLink Virtuoso. In its paid version available spatial indexing, but that's another story. Why this database? The author likes her. Fast, easy, powerful. Copied-launched-all works. Will only use regular tools With no plug-ins or special assemblies, only official build and a text editor.

index Type
We index the points using pixel-code (block) index.
What kind of pixel-code an index? This is a compromise swept curve, allowing to work with non-point objects in 2-dimensional search space. Combine to this class all the methods:
the

in Some way divide the search space into blocks
All blocks are numbered
Each indexed object is assigned one or more numbers of blocks in which it is located
in the search extent is split into blocks and for each unit or the continuous chain of units of a normal integer index (B-tree) get all the objects that belong to them.

The most known methods are:
the

good Old Igloo (or other swept curve) with limited precision. Space cut the bars, the step which (usually) is the typical size of the indexed object. The cells are numbered as X...xY...yZ...z... ie by gluing the cell numbers of the normals.
HEALPix. The name comes from the Hierarchical Equal Area iso-Latitude Pixelisation. This method is designed to split the spherical surface in order to avoid the main disadvantage of Igloo different areas under cells at different latitude. The sphere is divided into 12 equal square sections, each of which is further hierarchically divided into 4 sub — plot. An illustration of this method is now in the header of the article
Hierarchical Triangular Mesh (. HTM).This scheme of successive approximation to a sphere using triangles, since 8 start, each of which is hierarchically divided into 4 sub — triangle

One of the most famous circuits of block indices that is GRID. Spatial data are considered as a special case of the multidimensional index. At the same time:

space is cut in pieces using grid
the resulting cells are numbered in some way
data entered in a particular cell are stored together
attribute values are translated into numbers of cells along the axes using the “directory”, it is assumed that it is small and is placed in memory

if “directory” is large, it is split into levels a la the tree the

when you change, the grid can adapt, cells or columns to split, or, conversely, to merge

used

With all the disadvantages, the main of which, perhaps, is needapresent, need to adjust the index for a specific task, the block indexes there are pluses: simplicity and pantophagy. In essence it is an ordinary tree with additional semantic load. We somehow convert the geometry to numbers in the index and in the same way we get the keys to search the index.

data Type
Why? As already mentioned, the Block indexes, all the same what to index. The rasterizer for points trivial, but for polygons it would have to write, and it is completely irrelevant to the topic.

data Itself
Without further ADO,
the

put point on the lattice sites 4000Х4000 with the origin at (0,0) and step 1. the

will receive a square of 16 million points.

the cell block index will take for 10X10 the

so we have a 400X400 cells each with 100 points.

For an index, no matter how points are located inside the cell, and we kind of regularity will allow you to control the correctness of the results.

To generate the data we use here such script AWK(though, to be 'data'.awk'):
the

BEGIN {
cnt = 1; sz = 4000;
# Set the checkpoint interval to 100 hours -- enough to complete most of the experiments.
print"checkpoint_interval (6000);"; 
print"SET AUTOCOMMIT MANUAL;";
for (i=0; i < sz; i++) {
for (j=0; j < sz; j++) {
print "insert into \"xxx\".\"YYY\".\"_POINTS\"";
print " (\"oid_\",\"x_\", \"y_\") values ("cnt","i","j");";
cnt++;
}
print"commit work;"
}
print"checkpoint;"
print"checkpoint_interval (60);"; 
exit;
}

Server
the

Use the official Bild from the company website.
The author used the latest version 7.0.0 (x86_64), but it does not matter, you can take any other.
is Installed, if not, Microsoft Visual C++ 2010 Redistributable Package (x64)
Choose the directory for tests and copy there libeay32.dll, ssleay32.dll, isql.exe, virtuoso-t.exe from the archive 'bin' and virtuoso.ini from the archive 'database '
start the server using 'start virtuoso-t.exe +foreground', allow the firewall to open ports
Test the availability of the server with the command 'isql localhost:1111 dba dba'. After the connection is established, enter, for example. 'select 1;' and make sure that the server is alive.

Datasheet
To begin, create a table of points, nothing more:
the

create user "YYY";
create table "xxx"."YYY"."_POINTS" (
"oid_" integer,
"x_" double precision,
"y_" double precision,
primary key ("oid_") 
);

Now the actual code:
the

create table "xxx"."YYY"."_POINTS_spx" (
"node_" integer, 
"oid_" integer references "xxx"."YYY"."_POINTS", 
primary key ("node_", "oid_") 
);

where node_ is the ID of the cell space. Note that, although the index is created as a table, both fields are primary keys and hence, Packed into the tree.

Custom index options:
the

registry_set ('__spx_startx', '0');
registry_set ('__spx_starty', '0');
registry_set ('__spx_nx', '4000');
registry_set ('__spx_ny', '4000');
registry_set ('__spx_stepx', '10');
registry_set ('__spx_stepy', '10');

System function registry_set allows you to record string pairs name/value to the registry is much faster than keeping them in an auxiliary table and still covered by the transactions.

Trigger on entry:

create trigger "xxx_YYY__POINTS_insert_trigger" after insert 
on "xxx"."YYY"."_POINTS" 
referencing new as n
{
declare nx, ny integer;
nx := atoi (registry_get ('__spx_nx'));
declare startx, starty double precision;
startx := atof (registry_get ('__spx_startx'));
starty := atof (registry_get ('__spx_starty'));
declare stepx, stepy double precision;
stepx := atof (registry_get ('__spx_stepx'));
stepy := atof (registry_get ('__spx_stepy'));
declare sx, sy integer;
sx := floor ((n.x_ - startx)/stepx);
sy := floor ((n.y_ - starty)/stepy);
declare ixf integer;
ixf := (nx / stepx) * sy + sx;
insert into "xxx"."YYY"."_POINTS_spx" ("oid_","node_") 
values (n.oid_,ixf);
};

Put all of this in, for example, 'sch.sql' and execute by

isql.exe localhost:1111 dba dba <sch.sql

Download data
Once the schema is ready, you can fill in data.
the

gawk -f data.awk | isql.exe localhost:1111 dba dba >log 2>errlog

it took the author of 45 minutes (~6000 per second) on Intel i7-3612QM 2.1 GHz 8Gb memory c.
The memory size occupied by this server was 1 Gb, the database size on disk — 540Mb, i.e. ~34 bytes for its index.
Now you should verify that the result is correct, you enter in isql
the

select count(*) from "xxx"."YYY"."_POINTS";
select count(*) from "xxx"."YYY"."_POINTS_spx";

Both queries should produce 16 000 000.

As a guide you can use the data published by the here:

.
They obtained for PostgreSQL(Linux), 2.2 GHz Xeon.
On the one hand, the data obtained 9 years ago. On the other, in this case, the bottleneck are disk operations, random access to the disk in the intervening period has accelerated not so much.
It should be noted that the shading data and then indexing occurs (for obvious reasons) is significantly faster than indexing on the fly. But we are more interested in indexing it on the fly because for practical purposes more interesting spatial DBMS, not spatial search system.

And yet, it should be noted that when pouring the spatial data is very important the order in which they fill.
In this paper and in the above mentioned order of presentation of points was a perfect (or close to it) from the point of view of the spatial index. If the data mix, the performance drops quite dramatically regardless of the chosen DBMS.
If the data arrives randomly, we can recommend storing them in the sump, the overflow sort and pour portions.

Establish search
We will use such a great feature of the Virtuoso, as a procedural view.
the

create procedure "xxx"."YYY"."_POINTS_spx_enum_items_in_box" (
in minx double precision, 
in miny double precision, 
in maxx double precision, 
in maxy double precision)
{
declare nx, ny integer;
nx := atoi (registry_get ('__spx_nx'));
ny := atoi (registry_get ('__spx_ny'));

declare startx, starty double precision;
startx := atof (registry_get ('__spx_startx'));
starty := atof (registry_get ('__spx_starty'));

declare stepx, stepy double precision;
stepx := atof (registry_get ('__spx_stepx'));
stepy := atof (registry_get ('__spx_stepy'));

declare sx, sy, fx, fy integer;
sx := floor ((minx - startx)/stepx);
fx := floor (1 + (maxx - startx - 1)/stepx);
sy := floor ((miny - starty)/stepy);
fy := floor (1 + (maxy - starty - 1)/stepy);

declare mulx, muly integer;
mulx := nx / stepx;
muly := ny / stepy;

declare res any;
res := dict_new ();
declare cnt integer;
for (integer declare iy, iy := sy; iy < fy; iy := iy + 1) 
{
declare ixf, ixt integer;
ixf := mulx * iy + sx;
ixt := mulx * iy + fx;
for select "node_", "oid_" from "xxx"."YYY"."_POINTS_spx" 
where "node_" >= ixf and "node_" < ixt do 
{
dict_put (res, "oid_", 0);
cnt := cnt + 1;
}
}
declare arr any;
arr := dict_list_keys (res, 1);
gvector_digit_sort (arr, 1, 0, 1);

result_names(sx);
foreach (integer oid in arr) do 
{
result (oid);
}
};
create procedure view "xxx"."YYY"."v_POINTS_spx_enum_items_in_box" as
"xxx"."YYY"."_POINTS_spx_enum_items_in_box" (minx, miny, maxx, maxy) ("oid" integer);

In essence, this procedure is very simple, she finds out which cell block index falls search extent, and for every line of those cells is doing under request records the IDs of entries in the hash map, while eliminating duplicates and gives the resulting set of IDs as the result.
How to use the spatial index? For example:
the

select count(*), avg(x.x_), avg(x.y_) from
"xxx"."YYY"."_POINTS" as x, xxx."YYY"."v_POINTS_spx_enum_items_in_box" as a
where a.minx = 91. and a.miny = 228. and a.maxx = 139. and a.maxy = 295. 
and x."oid_" = a.oid 
and x.x_ > = 91. and x.x_ < 139. and x.y_ > = 228. and x.y_ < 295.;

Here is the join between the points table and synthetic recordsets of the procedural view by ID. Here procedural view plays the role of the rough filter, which works exactly the cell. Therefore, in the last line we placed a thin filter.
Expected result:
the

Number: (139 — 91) * (295 — 228) = 3216
Average X: (139 + 91 — 1) / 2 = 114.5 (-1 because right border excluded)
Secondary Y: (295 + 228 — 1) / 2 = 261 (...)

so here is one query in one transaction we performed a spatial search.

Benchmark
For making the flow of queries, I use the following AWK script:
the

BEGIN {
srand();
for (i=0; i < 100; i++) {
for (j=0; j < 100; j++) {
startx = int(rand() * 3900);
finx = int(startx + rand() * 100);
starty = int(rand() * 3900);
finy = int(starty + rand() * 100);
print "select count(*), avg(x.x_), avg(x.y_) from";
print "\"xxx\".\"YYY\".\"_POINTS\" as x, "
print "\"xxx\".\"YYY\".\"v_POINTS_spx_enum_items_in_box\" as a"
print "where a.minx = "startx". and a.miny = "starty"."
print " and a.maxx = "finx". and a.maxy = "finy". and"
print "x.\"oid_\" = a.oid and x.x_ >= "startx". and x.x_ < "finx"."
print " and x.y_ >= "starty". and x.y_ < "finy".;"
}
}
exit;
}

As you can see, it turns out 10,000 queries in random places with random size (0...100). Prepare 4 set and perform three series of tests — one run isql with the tests, 2 parallel and 4 parallel. On an eight-core processor, more parallelization does not make sense. Let's measure the performance.
the

stream 1: 151 sec = ~66 requests per second per thread
2 flow: 159 sec = ~63...
4 threads: 182 sec = ~55...

You can try to compare these times with the add obtained by spinal by means of two conventional indices for x_& & y_. We will index the table by those fields and run a similar series of tests with queries like
"xxx"."YYY"."_POINTS" as x where x.x_ > = 91. and x.x_ < 139. and x.y_ > = 228. and x.y_ < 295.; the Result is: a series of 10,000 queries is performed for 417 seconds with 24 queries per second.

Opinion
The presented method has several disadvantages, chief among which, as already noted, is the need to configure it for a specific task. Its performance is heavily dependent on settings that can not be changed after creating the index. However, this is true for most spatial indexes at all.

But even this written on the knee index gives a tolerable performance. Furthermore, since we control the whole process, on this basis it is possible to do whiter complex structures, for example, many table indexes to build that the regular methods is difficult. But about it somehow next time.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express

Fasten a spatial index to the unsuspecting OpenSource DBMS

Комментарии

Отправить комментарий

Популярные сообщения из этого блога

Monitoring PostgreSQL with Zabbix

A Bunch Of MODx Revolution + LiveStreet

PostgreSQL load testing using JMeter, Yandex.Tank and Overload