Monitoring PostgreSQL with Zabbix

image

PostgreSQL is a modern, dynamic DBMS with a very large set of possibilities which allow to solve a wide range of tasks. Using PostgreSQL as a rule refers to a very critical segment of the it infrastructure associated with the processing and data storage. Given the special place of the DBMS infrastructure and the criticality of tasks, the question of monitoring and appropriate control of the DBMS. In this respect, PostgreSQL has a wide internal means of collecting and storing statistics. Collected statistics allows to obtain quite a detailed picture about what is happening under the hood during operation of the DBMS. These statistics are stored in special system tables, views, and constantly updated. Performing regular SQL queries to these tables it is possible to obtain a variety of information about databases, tables, indexes and other DBMS subsystems.
Below I describe the method and tools for monitoring PostgreSQL monitoring system Zabbix. I love this monitoring system because it provides wide opportunities for the implementation of the custom monitoring various systems and processes.

Monitoring will be based on SQL queries to tables of statistics. The queries issued in the form of an additional configuration file for zabbix agent, in which SQL queries are wrapped in the so-called UserParameters — custom monitoring settings. User parameter to Zabbix is a great way which allows you to configure monitoring for unusual things, things that in our case are the parameters of PostgreSQL. Each custom parameter consists of two elements: key Name and Command. Name is a unique name which does not overlap with any of the key names. Command — the actual command is an action that must run the zabbix agent. In an extended version of this command can be transferred to different settings. In the configuration of Zabbix, it looks like this:
the
UserParameter=custom.simple.key,/usr/local/bin/simple-script
UserParameter=custom.ext.key[*],/usr/local/bin/ext-script $1 $2

Thus, all requests to the PostgreSQL statistics represent a client requests psql wrapped in custom parameters.

Strengths:
the
    the
  • the minimum requirements for the configuration of the observed node in the simplest case, adding config and restart zabbix agent (complex case involves donaustrasse access rights to PostgreSQL);
  • the
  • set parameters for connecting to PostgreSQL, as threshold values for triggers are executed via variables the macros in the web interface — so no need to climb into the triggers and to produce a template for different thresholds for different hosts (macros can be assigned to the host);
  • the
  • wide range of collected data (connections, time transactions, statistics databases and spreadsheets, streaming replication etc.);
  • the
  • low-level detection for databases, tables and servers on stand-by.

Weaknesses:
the
    the
  • a lot of observed parameters, maybe someone wants to disable.
  • the
  • depending on PostgreSQL version some things will not work. In particular, it relates to replication monitoring, because some features simply aren't present in older versions. It was written with an eye to version 9.2 and above.
  • the
  • also for some things you must have installed extensions pg_buffercache and pg_stat_statements — if the extension is not installed the parameters will not be available for observation.

Monitoring:
the
    the
  • allocations and write buffers, checkpoint and time of recording in the process checkpoints — pg_stat_bgwriter
  • the
  • General information on shared buffers — this requires the extension of pg_buffercache. I also want to note that queries to these statistics are resource-intensive, as reflected in the documentation for the extension, so depending on needs you can either increase the polling interval, or disable the settings entirely.
  • the
  • client connections and execution time of queries/transactions — pg_stat_activity.
  • the
  • the size of the database and summary statistics for all database (commits/rollback, read/write, temporary files) — pg_stat_database
  • the
  • potulicka stats (read/write, number of office tasks such as vacuum/analyze) — pg_stat_user_tables, pg_statio_user_tables.
  • the
  • streaming replication (server status, the number of replicas, log them) — pg_stat_replication
  • the
  • other stuff (number of rows in the table, the existence of the trigger, configuration settings, WAL logs)

Additionally it should be noted that for the collection of statistical data must include the following parameters in postgresql.conf:
track_activities — includes tracking commands (queries/statements) all client processes;
track_counts — enables the collection of statistics on tables and indexes;

To install and configure.
Everything needed for setup is in a Github repository.
the
# git clone https://github.com/lesovsky/zabbix-extensions/
# cp zabbix-extensions/files/postgresql/postgresql.conf /etc/zabbix/zabbix_agentd.d/

Further, it should be noted that to run queries from the agent, need to have in pg_hba configuration was defined corresponding access — the agent should be able install the connection to the postgres service in the target database and execute the queries. In the simplest case, you need to add the following line in pg_hba.conf (for different distribution file location may vary) allow connection from the name of the postgres database mydb from localhost.
the
host mydb postgres 127.0.0.1/32 trust

Unforgettable, that after changing pg_hba.conf service postgresql need to do reload (pg_ctl reload). However, this is the easiest option and not very safe, so if you want to use a password or a more complex access pattern, please refer to the pg_hba and .pgpass.

So the configuration file is copied, it remains to load it in the main configuration, ensure that the main agent configuration file, there is a line Include c provided by hosting additional configuration files. Now restart the agent, and then can check the operation by performing a very simple check — pgsql use.ping in square brackets specify the option to connect to postgres, which will be passed to the psql client.
the
# systemctl restart zabbix-agent.service
# zabbix-get-s 127.0.0.1-k pgsql.ping['-h 127.0.0.1-p 5432 -U postgres -d mydb']

If you correctly registered access, you will be back the service response time in milliseconds. If it returns an empty string, then there is a problem with access in pg_hba. If returned string ZBX_NOTSUPPORTED — configuration is not podgruzilo, check the config of the agent, the path in the Include and put on the right config.

When the control team gets the correct answer, download the template and upload it to the web interface and assign to the target host. Download the template from the repository (postgresql-extended-template.xml). After you import you need to go to the settings tab of template macros and customize them.
image
Below is the list and a brief description:
the
    the
  • PG_CONNINFO is the connection parameters which will be passed to the psql client when running the query. This is the most important because the macro defines the connection to a postgres service. String the default is more or less universal for any occasions, but if you have multiple servers and each server is different from other settings, the hosts in Zabbix you can define a macro with the same name and ask him individual value. When running the test macro, the host has priority over the macro template.
  • the
  • PG_CACHE_HIT_RATIO is the threshold for the trigger on the percentage of successful cache hits; fires a trigger if the hit ratio is below this level;
  • the
  • PG_CHECKPOINTS_REQ_THRESHOLD — threshold value for the checkpoint-on-demand
  • the
  • PG_CONFLICTS_THRESHOLD — threshold conflicts arising when running queries on the standby server;
  • the
  • PG_CONN_IDLE_IN_TRANSACTION — threshold value for the connection that opened the transaction and nothing it does not do (bad transaction);
  • the
  • PG_CONN_TOTAL_PCT — threshold for the percentage of open connections to the maximum possible number of connections (if 100% of all connections ended);
  • the
  • PG_CONN_WAITING — threshold for locked queries waiting for completion of other requests.
  • the
  • PG_DATABASE_SIZE_THRESHOLD — threshold value for the size of the databases;
  • the
  • PG_DEADLOCKS_THRESHOLD — threshold of deadlocks (fortunately, they are automatically resolved, but their presence is desirable to be aware of, because it is direct evidence of poorly written code);
  • the
  • PG_LONG_QUERY_THRESHOLD — threshold value for the execution time of queries; the trigger -- if there is a request whose execution time is more than this level;
  • the
  • PG_PING_THRESHOLD_MS — threshold value for the response time of the service;
  • the
  • PG_SR_LAG_BYTE — threshold for replication lag in bytes;
  • the
  • PG_SR_LAG_SEC — threshold for replication lag in seconds;
  • the
  • PG_UPTIME_THRESHOLD — threshold value of uptime, if the uptime is below the mark means the service is restarted;

From the text of the triggers, it should be clear why the need for these thresholds:
the
    the
  • PostgreSQL active transaction to long — fixed long transaction or request;
  • the
  • PostgreSQL cache hit ratio too low — too low percentage of a cache hit;
  • the
  • PostgreSQL deadlock occured — fixed a deadlock can occur;
  • the
  • PostgreSQL idle in transaction connections to high — a lot of connections in the idle in transaction;
  • the
  • PostgreSQL idle transaction to long — fixed long transaction in the state of idel in the transaction;
  • the
  • PostgreSQL number of waiting connections to high — zafiksirovan request or transaction is pending;
  • the
  • PostgreSQL recovery occured conflict — a conflict was detected while restoring on the replica;
  • the
  • PostgreSQL required occurs to frequently the checkpoints — the checkpoints happen too often;
  • the
  • PostgreSQL response to a long long answer;
  • the
  • PostgreSQL service not running service not running;
  • the
  • PostgreSQL service was restarted — the service has restarted;
  • the
  • PostgreSQL total number of connections to the high total number of connections is too large and approaching max_connections;
  • the
  • PostgreSQL waiting transaction to long — fixed too long request or transaction is pending;
  • the
  • PostgreSQL database {#DBNAME} to large — size database is too big;
  • the
  • PostgreSQL streaming lag between {HOSTNAME} and {#HOTSTANDBY} to high — lag replication between servers is too big.

Low-level discovery rules
the
    the
  • PostgreSQL databases discovery — detection of existing database with the ability to filter by regular expressions. When detection is added to the schedule of dimensions;
  • the
  • PostgreSQL database tables discovery — detection of available tables in the monitored database with the ability to filter by regular expressions. Be careful with the filter and only add the tables that you really interesting, because this rule gives rise to 21 setting at every table. Upon detection of added charts on the size, scan, edit lines and statistics.
  • the
  • PostgreSQL streaming stand-by discovery — detection of the connected replicas. Upon detection of a chart is added with a lag replication.

Of graphics available, if we talk about graphics, I tried to group the observed parameters, while not overloading the graphics are overly large number of options. So the information from pg_stat_user_tables posted on 4 graphics.
the
    the
  • PostgreSQL bgwriter — General information about what is happening with the buffer (how much is allocated, how much and how recorded).
  • the
  • PostgreSQL buffers — General information as shared buffers (how much buffer, how much is used, how many dirty buffers).
  • the
  • PostgreSQL checkpoints — information on what is happening checkpoints.
  • the
  • PostgreSQL service response the service response time and average query execution time.
  • the
  • PostgreSQL summary db stats: block hit/read — reading from the cache and from disk.
  • the
  • PostgreSQL summary db stats: events in the database (didlake, conflicts, commits, rollback).
  • the
  • PostgreSQL summary db stats: temp files for temporary files.
  • the
  • PostgreSQL summary db stats: tuples — General information the row changes.
  • the
  • PostgreSQL transactions is the execution time of the queries.
  • the
  • PostgreSQL uptime — the uptime and the percentage of cache hits.
  • the
  • PostgreSQL write-ahead log — WAL information on the journal (the volume of records and number of files).
  • the
  • PostgreSQL database {#DBNAME} size — info on how to change the size of the database.
  • the
  • PostgreSQL table {#TABLENAME} maintenance — operations table maintenance (autovacuum, autoanalyze, vacuum, analyze).
  • the
  • PostgreSQL table {#TABLENAME} read stats — statistics by reading from the cache drive.
  • the
  • PostgreSQL table {#TABLENAME} rows — edit rows.
  • the
  • PostgreSQL table {#TABLENAME} scans the information on scan (sequential/index scans).
  • the
  • PostgreSQL table {#TABLENAME} size — the size info of a table and its indexes.
  • the
  • PostgreSQL streaming replication lag with {#HOTSTANDBY} — lag size replication server replicas.

In conclusion, a few plots-examples:
Here we can see that regularly creates temporary files, you should look for the culprit in the log and review work_mem.
image

Here, the ongoing events in the database — commits/rollback and conflicts/dedlock — as a whole all is well here.
image

Here the state of streaming replication from one server the time lag in seconds and bytes.
image

And final chart — the response time of the service and average request time.
image

That's all, thank you for your attention!
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

March Habrameeting in Kiev

PostgreSQL load testing using JMeter, Yandex.Tank and Overload