Postgre(no)SQL or another data storage with a flexible structure

When the question comes about storing in the DB flexible (not known in advance, often changing) data structures, developers usually turn to the "great and terrible" the EAV-pattern, or to the now trendy NOSQL databases.
Not so long ago this task was in front of me.
EAV. Cause I have a strong dislike, and even spoken and written about this, it was very much negative (Kite, Fowler, Carwin, Gorman). The main disadvantage is that when you write the queries necessary to operate already not a real entity ("Employee", "House", "Client", which is SQL), and objects organizovannyi at a lower level (sorry for the confusion). Therefore, it was not the most desirable option.
NOSQL. Initially very interested in this option (particularly MongoDB). After prolonged use of relational first begin to experience the feeling of total freedom, which takes the breath away. Storage of documents of any structure, the immediate creation of new collections, queries to them — beauty! But after a short time, the euphoria began to subside, and the problems to show up:
— Poor query language (IMHO) + no joynow;
— Lack of (good article was recently on this topic (and only this) habrahabr.ru/post/164361);
— No built-in support for referential integrity;
The lack of bells and whistles in the form of stored procedures/functions, triggers, views, and much more.
— In my application in addition to data with flexible(variable) structure is also necessary to keep an ordinary static data table of users, visits, employees etc to Work with that (again IMHO) much easier and (most importantly) more reliable in a relational database (the same referential integrity, etc.).



The first problem is (partially) I tried to solve with the help of ORM (it was Spring Data), it is possible to write reasonable queries against the objects, but you need to pre-generate and compile all classes (appropriate collections of) and queries to operate already with them. For me it didn't fit, because the collection must be created and change often and quickly — on-the-go.
Second — by creating a separate collections storage structures all other collections, to check the correctness of input data, etc.
To solve the remaining problems is not reached, quit.
Already at this stage, my base began to resemble a very fragile structure, completely independent of the application, plus I had to manually implement many of the things that the majority of relational can do so, out of the box. Maybe it's OK, but then I'm not used to this, as it is not for myself.

Then I thought about how cool it would be to combine relational and NOSQL database. On the one hand, all the power of relationshi with all the accompanying with the other — the ease and elegance document solutions. And really, what prevents you to store objects with a flexible structure in a special separate table (or tables) for example in xml format, and access them using XPATH, especially because many modern DBMSs have developed means for working with XML (including indexing).
I decided to try on a small example with Postgresql, what happens, will look like the queries:

Enough to start two service tables, I think no comment:

the
CREATE TABLE classes
(
id integer NOT NULL,
name text,
is_closed boolean,
obects_count integer,
Classes_pk CONSTRAINT PRIMARY KEY (id )
);

CREATE TABLE objects
(
id integer NOT NULL,
body xml
id_classes integer,
Objects_pk CONSTRAINT PRIMARY KEY (id ),
Classes_objects CONSTRAINT FOREIGN KEY (id_classes)
REFERENCES classes (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);

CREATE INDEX fki_classes_objects
ON objects
USING btree
(id_classes );



Create two entities for experimetnal:

the
INSERT INTO classes(
id, name, is_closed, obects_count)
VALUES (1, 'customers', FALSE, 0);

INSERT INTO classes(
id, name, is_closed, obects_count)
VALUES (2, 'orders', FALSE, 0);


Prepare two functions for generating random test data (taken on the Internet):

the
CREATE OR REPLACE FUNCTION random(numeric, numeric)
RETURNS numeric AS
$BODY$
SELECT ($1 + ($2 - $1) * random())::numeric;
$BODY$
LANGUAGE sql VOLATILE
COST 100;

CREATE OR REPLACE FUNCTION random_string(length integer)
RETURNS text AS
$BODY$
declare
chars text [] : = '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
result text := ";
i integer := 0;
begin
if length < 0 then
raise exception 'Given length cannot be less than 0';
end if;
for i in 1..length loop
result := result || chars[1+random()*(array_length(chars, 1)-1)];
end loop;
return result;
end;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;


Populate the table with random data, objects of the classes "Customer" and "Order" (one-to-many relationship, each customer made five orders):

the
DO $$
DECLARE
customer_pk integer;
order_pk integer;
BEGIN
FOR i in 1..10000 LOOP
customer_pk := nextval('objects_id_seq');
order_pk := nextval('objects_id_seq');

insert into objects (body, id_classes) values((
'<Customers>
<Customer>
<ID>' || customer_pk || '</ID>
<Name>' || random_string('10') || '</Name>
<Partners>' || random_string('10') || '</Partners>
</Customer>
</Customers>')::xml, 1);


for j in 1..5 LOOP

insert into objects (body, id_classes) values((
'<Orders>
<Order>
<ID>' || order_pk || '</ID>
<Customer_id>' || customer_pk || '</Customer_id>
<Cost>' || random(1, 1000) || '</Cost>
</Order>
</Orders>')::xml, 2);

end loop;

END LOOP;
END$$;


The first query will select the maximum cost of the order:

the
explain select max(((xpath('/Orders/Order/Cost/text()', O. body))[1])::text::float) as cost_of_order
from Objects O
where O. id_classes = 2;

the
/*
Aggregate (cost=2609.10 2609.11..rows=1 width=32)
-> Seq Scan on objects o (cost=0.00..2104.50 rows=50460 width=32)
Filter: (id_classes = 2)
*/


The request was a bit tricky, but still quite clear: once it is clear which entity is running a query and on what attribute. Oddly enough it turned out a full scan, but nothing prevents to build the index on attribute Cost:

the
create index obj_orders_cost_idx on objects using btree (((xpath('/Orders/Order/Cost/text()', body))[1]::text::float));


And the query now works much faster and uses the index:
the
/*
Result (cost=0.15..0.16 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.15 rows=1 width=32)
-> Index Scan Backward using obj_orders_cost_idx on objects o (cost=0.00..7246.26 rows=50207 width=32)
Index Cond: (((( xpath('/Orders/Order/Cost/text()'::text, body, '{}'::text[]))[1])::text)::double precision IS NOT NULL)
Filter: (id_classes = 2)
*/


Now try to select information about orders several specific employees, i.e. a relationship between two tables:

the
explain select (xpath('/Customers/Customer/Name/text()', C. body))[1] as customer
, (xpath('/Orders/Order/Cost/text()', O. body))[1] as cost_of_order
from objects C
, objects O
where C. id_classes = 1
and O. id_classes = 2
and (xpath('/Orders/Order/Customer_id/text()', O. body))[1]::text::int = (xpath('/Customers/Customer/ID/text()', C. body))[1]::text::int
and ((xpath('/Customers/Customer/ID/text()' ,C. body))[1])::text::int between 1997585 and 1997595;


the
/*
Hash Join (cost=1873.57 6504.85..rows=12867 width=64)
Hash Cond: (((( xpath('/Orders/Order/Customer_id/text()'::text, o.body, '{}'::text[]))[1])::text)::integer = (((xpath('/Customers/Customer/ID/text()'::text, c.body, '{}'::text[]))[1])::text)::integer)
-> Seq Scan on objects o (cost=0.00..2104.50 rows=50460 width=32)
Filter: (id_classes = 2)
-> Hash (cost=1872.93 1872.93..rows=51 width=32)
-> Bitmap Heap Scan on objects c (cost=1872.93..196.38 rows=51 width=32)
Recheck Cond: (id_classes = 1)
Filter: ((((( xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer >= 1997585) AND ((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer <= 1997595))
-> Bitmap Index Scan on fki_classes_objects (cost=0.00..196.37 rows=10140 width=0)
Index Cond: (id_classes = 1)
*/


Expected FULLSCAN, now slightly index:

the
create index obj_customers_id_idx on objects using btree (((xpath('/Customers/Customer/ID/text()', body))[1]::text::int));
create index obj_orders_id_idx on objects using btree (((xpath('/Orders/Order/ID/text()', body))[1]::text::int));
create index obj_orders_customerid_idx on objects using btree (((xpath('/Orders/Order/Customer_id/text()', body))[1]::text::int));


Now it turns out fun:

the
/*
Hash Join (cost=380.52 5011.80..rows=12867 width=64)
Hash Cond: (((( xpath('/Orders/Order/Customer_id/text()'::text, o.body, '{}'::text[]))[1])::text)::integer = (((xpath('/Customers/Customer/ID/text()'::text, c.body, '{}'::text[]))[1])::text)::integer)
-> Seq Scan on objects o (cost=0.00..2104.50 rows=50460 width=32)
Filter: (id_classes = 2)
-> Hash (cost=..379.88 379.88 rows=51 width=32)
-> Bitmap Heap Scan on objects c (cost=204.00..379.88 rows=51 width=32)
Recheck Cond: ((((( xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer >= 1997585) AND ((((xpath('/Customers/Customer/ID/text()'::text, body, '{}'::text[]))[1])::text)::integer <= 1997595) AND (id_classes = 1))
-> BitmapAnd (cost=204.00 204.00..rows=51 width=0)
-> Bitmap Index Scan on obj_customers_id_idx (cost=0.00..7.35 rows=303 width=0)

-> Bitmap Index Scan on fki_classes_objects (cost=0.00..196.37 rows=10140 width=0)
Index Cond: (id_classes = 1)
*/


This request also has not lost its clarity, however it can be more brush to deal with the type conversion, to optimize the structure of xml etc Work a lot more, it's just a small example.

What else can be done:

1. Flexible search attributes objects of any class;
2. Table objects can be particioate (at least partially), for example, to store objects of large classes is physically separate.

Storing data in the database in xml format estestveeno is not new, however, if you find a solution to the my question of the information about it was very small, not to mention specific examples. I hope someone be useful, or(and) to hear in the comments, the reviews and opinions of people who had worked with a similar scheme.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Monitoring PostgreSQL with Zabbix

PostgreSQL load testing using JMeter, Yandex.Tank and Overload

MODX Revolution meets Fenom