The visualization of data processing tools from Github

you are using MySQL, Postgres or Mongo, and maybe even Apache Spark? Want to know where to start these projects and where are they coming from now? In this article I will introduce an appropriate visualization




Today, there are a huge variety of data processing tools for every taste: from classic relational databases and ending with the new-fangled tools for processing streams of events in real time. Special popularity and the love of the developers call the project an open source. Any problem encountered with such technologies, will not be buried in the sand vendor like Oracle (which, of course, will provide you with a workaround for the workaround), and will be subjected to open discussion, and ultimately fixed. And can not only report the problem but actually fix it and offer your code to the community.

Almost all open-source projects keep their source code on github. There is either the main repository or at least its mirror supported to date. The version control system contains a large amount of information about every change ever made in the source code of each of the projects that are stored there. If you analyze that information?

For the analysis, I took the list of the most popular data processing tools and analyzed their repository. Then for each user, making changes in these tools, I did sample their latest activity on Github, and gathered out the most popular repositories in which they made changes. Therefore, the list of repositories ranked in the visualization, not limited to familiar to me personally and more objectively reflect the real situation in the community. And so projects like Node.js, Docker and Kubernetes was in the visualization despite the fact that they have a very indirect relationship to the field of data processing.

In essence, this rendering has a remarkable property — it is absolutely independent view of the data processing projects, open source, free of any marketing. Indeed, analyses of real changes to the real code to real people. In my opinion it's great because the vendors promoting their products, greatly undermined the credibility of the various analytic publications.

So, in this work it was analyzed 150 of your github repositories, more than half a million commits committed 8333 unique developers. For data extraction, parsing and communicating with the Github API used Python for data storage and data analysis — Postgres, for visualization is Matplotlib. The visualization part was done manually, motion algorithms of the graph's vertices were also registered (in fact, the top attracts associated with it, and repelled from nearby). Here is the visualization:



Recommend watching in the highest quality to normally see all project names.

Each node of the graph represents one project. The area representing the periphery is proportional to the number of unique people who have made changes to a specific project over 10 weeks until we present visualization (see scale live upstairs). The text of the name of the project also depends on the number of unique contributors — big yellow text for the largest projects, projects smaller text smaller, and the smallest it does not appear at all. The edge between projects A and B indicated in that case, if there is one person who made the change in both of these projects during the 10 weeks preceding the moment of visualization. I thought it was reasonable because it ties together the forks the same technology and simply related projects like Apache Hadoop and Apache HBase.
My original publication is available here.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

A Bunch Of MODx Revolution + LiveStreet

Monitoring PostgreSQL with Zabbix

PostgreSQL load testing using JMeter, Yandex.Tank and Overload