MilkyWeb — Graph of Everything



In this article I want to share my thoughts about ways to solve the fundamental problems of the modern Internet. I want to describe a model that, in my opinion, can help even better to organize the knowledge on the Internet, and to demonstrate their attempt to implement such a model.

the

Intro

a Social network and search engines are trying to organize as much information as possible about the world and in particular about the user.
In computer science a basis for describing any subject area (AT) is ontology or its simplified variation of graphs. With their help perhaps the most uniquely for a computer to describe any knowledge base.

In the existing web space created a large number of specialized graphs: Facebook, Linkedin, Foursquare, etc.
As you know Google expands its Knowledge Graph and is actively using it to search engines.

The problem is that in the world an infinite number of domains and to create a new graph, it is customary to create a new social network.

The project MilkyWeb (MW, Milkyway), which I want to present is trying to create a universal tool to describe any subject areas (creating any graphs) in one place.
In other words, it is an attempt to create a universal the social network the base knowledge of everything.

the Project is still not out of alpha stage, so the interface is poor, I ask not to be angry. The site was prepared, only under Chrome. To support the cross-browser capability decided the time is not yet to spend, so I apologize to users of other browsers.

the

Ideology



The main ideology of the project is a model of ontology – mathematical knowledge representation.
It is based on three pillars: the concepts, individuals and predicates.

Concepts is an abstract concept of the world. Roughly speaking, this generalized (collective) names of things and phenomena that surround us.
A lot of concepts form a hierarchy, for example, the term "Programmer" is derived from the concept of "Man" and the latter, in turn, is the "Body".
You can draw an analogy with programming: the Concepts in the ontology are classes in OOP.
Concepts are of two types: abstract concepts and a set (or Ancestory from the English. Ancestor – ancestor, forefather). The concept of "friendship" in the abstract, while "machine" is the name of many real objects.

Individuals – the objects that surround us in the real world. Every individual is an implementation of at least one concept ancestor. In the context of OOP concepts individuals are instances of classes.
For example: the object "albert Einstein" – is an individual of the concept "Scientist." Inheritance is naturally supported. Since "Scientist" is "a Person", the "albert Einstein" is also "Man".
When a new user creates an account in MW, in fact this means creating a new individual of the concept "Person" in the ontology.

In terms of graph concepts and individuals are vertices in the graph, while edges (or arcs) are predicates.

Predicates are properties by which the graph vertices are connected.
A simple example of a predicate, as many could guess, this relationship of friendship in FB or VK.



Depicted in the figure above, the relationship is called triplet, as it involves three constituents: the subject "Richard Feynman" (top graph), the predicate "Born" (arc graph), the facility "New-York" (top graph).

In fact, the whole objective of the project is Milkyway boils down to the user to be able to create a page of any object of the surrounding world (concept or individual) and could semantically correct to link it with other pages (using the predicate).
Each predicate is created in conjunction with one or more concepts.
For example, the properties of "friend" or "mother" may be only individuals of the concept "man"; and the predicate "CEO" can communicate "person" and "company."

Predicates are literal. Such predicates-the"literal" point is not on the top of the graph, and for any value. Every literal has a type, e.g. string, integer, date, geographical coordinates, etc. (currently supports only URL-literals).

The concepts and predicates are the framework of any ontology, that is, a template on which to build the entire graph, so for the moment these entities can only be created by site administration. This process involves not only the creation of entities as such, but a configuration that the user sees.
For example, every predicate is a threshold on the maximum number of triplets. So with the predicate "mother" of the individual can be only one triplet, and with the predicate "friend" a lot.

Example

As I said, the administration creates a skeleton ontology as users fill it.
Give an example of filling of the subject area on the basis of the concept "movie".

The administrator creates the concept of "film" and set the necessary predicate such as "starring", "Director", "producer", "premiere", "country", "favorite movie", "watch."

The user FOO on the basis of the concept "film" creates a page (individual) "pirates of the Caribbean" and begins "to describe."
Using the predicate "cast" it indicates that the film is in individuals "johnny Depp" and "keira Knightley".
He then associates the page with the producer, Director and country.
Literal "premiere" the user specifies that the premiere of the film took place on 28 June 2003.

Okay, the basic data about the film was made, but what next? Then FOO can specify that "pirates of the Caribbean" is his "favorite movie".
At this time, the user of GOO, which accounts for the other FOO, just bored at the monitor and saw in my feed that you just created FOO triplet. He took it as a call to action and decided to download movie torrents buy the DVD with this picture and then watch it! Starting to taste the film, he created a triplet "look" "pirates of the Caribbean", thereby telling the world a small part of your life with one click!

I chose the subject area "movies" for example. Facebook engineers are working on in order to structure such moments of people's lives. Read more: www.wired.com/business/2012/11/mike-vernal-facebook

I also Want to note that the predicate "favorite movie" and button "I like" on the movie page on IMDB is not the same thing. The semantics of likes is very blurred and does not allow us to say what is meant by the user when “like” a particular page.

This structure greatly simplifies the description of a particular subject area. If Facebook is a constant set of templates to generate pages, in the above system templates can be created on the fly. If at one point of time we decide to bring in a new social network, you need only to create a set of concepts and predicates specific to this sector.

At this point, all generated pages support English only (costs to consider when searching). The plans to make the mechanism of localization to other languages.

the

Data sharing and challenges of Big Data

I have not found a suitable settled in Russian expression meaning sharing in the understanding “to share information” or “disseminate information”, so the term in the title left untranslated.

In recent times to describe the region, which is characterized by rapid growth of amount of information, it is customary to use the term Big Data. A priori, this term refers to a problem: the pace of data generation is so great that the most valuable information can get lost in the General stream. That information is not lost, it is necessary to structure and classify.
As shown, the formation of a news-feed'and based on the posts “from friends” is not the best option. More precisely, this method is good to get information about those people around you, but not about interesting things in General.
As a consequence, in Facebook news feed littered with seals, and quotations from the “great” people. You can try to subscribe to a themed pub, but it does not guarantee delivery in a custom feed of all generated in the moment of time which could be interesting for the user.
Facebook sculpts a crutch for a crutch to deliver only the most relevant information in the tape user. And to some extent, but the algorithm for constructing a news-feed based on user action (analysis of likes, comments etc.), so this also is not universal.

In my opinion, the most successful approach models "came in, learned all relevant, left" came from Twitter and Hacker News.
So the mechanics of information dissemination in MilkyWeb I originally tried to do something in-between T and HN. I.e. a user visits the site and gets all the information that could be of interest for the last time X.
But not only from the pages on which it is signed (Twitter, FB, VK), but also on thematic streams (HN).

MW to circulate can be text (up to 2000 characters), links, and videos (YouTube). Photos yet – they're expensive to keep.

How a user can share information and who this information will get?

The user can:

    the
  1. to post messages to my page;
    In this case, the message will reach those users who followed sender.
    it Should be noted that if user a has created at least one triplet with the user, And is signed V.
  2. the
  3. to post messages to the page to another user;
    Obviously, the message reaches only the intended recipient.
  4. the
  5. to post messages on the pages of individuals “not user”;
    The message will reach everyone who is on this being signed.
  6. the
  7. to broadcasting messages in thematic threads.
    Thematic streams – it's something that applies to concepts. Ie you can post a message on the page "programming". In this case, the message will reach to everyone who is subscribed to "Programming" and all the users who "inherit" the concept of "Programmer".

The last two options is that, as attempts to solve problems of Big Data. The basic idea is this:
User X has the information, which thematically refers to the sphere of real life. He does not think about where this info post it and just throws it in the General thematic stream for this or that. And now the task of the system is based on the actions of other users (e.g., ranking or repost) to identify the most valuable data from the General flow.

Work on this mechanism is still in progress. Ranking system of content yet, but it will be implemented in the near future, and have ideas based on this to make a custom news stream relevant than other networks. It is the model described in the previous Chapter, allows semantically unambiguously distinguish between and properly classify information.

Naturally, this approach can generate waves of spam. At the moment on the website, you can not post more than one message in 20 seconds. In the future I will be more sensible to solve the issue. Now the task is to test the mechanics on the viability and to identify possible critical moments.

As the reader probably guessed, in this system, there is great potential for the dissemination of targeted content. You can make complex selections for selecting the target audience. For example, to send a message to everyone "Programmer" and "lives in" "Moscow"; or those who "bought" an iPhone and "buy" "iPad," or anyone who "leads" a Mercedes.

the

Semantic web

Semantic web (JV) is a web space where content is generated by the human computer understands.
This can be achieved by adding metadata to a web document (e.g. html). Metadata are widely used in the network and perform a vital role in sourcing, structuring data, etc.
But in order for a search engine could “understand” the content of a particular page, it is necessary that this page was accompanied by a separate document with a clear for the computer description (as a graph) that part of the world, referred to on the original page.

The specification requires that these meta-documents were written in RDF. The problem is that these files should someone be created to then be attached to an html document.

This is actually the problem for the solution of which I took two years ago in the form of a thesis. The goal was to make a convenient and interactive tool for creating RDF descriptions to a centralized repository of meta-data, where they will accumulate and will not be duplicated.

Over time I slightly deviated from the predetermined direction in favor of the social aspect. But already now it is possible to obtain an RDF description of a particular entity by clicking on the address milkyweb.net/rdf/{ c | p | i }/id_сущности. For example, RDF documents of the individual "Moscow" and concept of "Human" are addresses milkyweb.net/rdf/i/10460 and milkyweb.net/rdf/c/10000 accordingly (user info of course not public).

That is all that remains of the webmaster is to simply attach the link to the desired object to the web page of your website. In the future the search engine will pick up the document at the specified url and will be able to classify content on the page, increasing the relevance of search results. Or you real-time see the emergence of content for a specific entity throughout the Internet. Agree, pretty cool! :)

For specialists in this field note that integration with existing dictionaries is planned.

Of course, I greatly simplified. To popularize SP, one social network will not do. You will likely need to create special frameworks for web developers who automatiseret the process of tagging content with metadata. But I believe that sooner or later a mechanism will work, and the first step in this direction is the creation of a global knowledge base of the Internet.

the

Problem

the biggest problem I faced, lie in the ideology of the project and in the terminology of the ontology as such.

All specifications of SP technologies (RDF, OWL) from W3C say that to describe web ontologies can be dispensed with the Concepts, Individuals and Predicates, and I believed it for a while.

In Russian Wikipedia you can find a description of the concept "Concept":
Concepts (concepts) are abstract groups, collections or sets of objects. They may include instances of other classes or combination of both.

And then there is an example with a small digression:
the Concept of "people" who invested the term "person". What is "man" as a sub concept or an instance (individual) is dependent on ontology.

Inconspicuous at first glance, the comment (in italics) is a fundamental problem philosophers of all time.

If we start to create a global ontology for this “classic rules”, our whole structure will collapse immediately, which I was convinced personally.
Initially I thought the concepts in my network is an abstract notion, which may or may not “have” individuals.
And individuals, in turn, is a real objects that you can touch hands, and which “implement” some concepts.
But suppose we have the concept of "Phone". Now we need to create a page iPhone. But what is the "iPhone" as a concept or an individual? Suppose that is an individual. And at some point in time the user FOO decides to create a personal page in the device "iPhone", which lies in his pocket. Why? No matter, maybe he wants to put it up for sale. It is important here that if "iPhone" is the individual, then it is impossible to create a page specific device, because we have limited the level of abstraction and the system ceases to be coherent.
Okay, let's assume that "iPhone" is a concept. But initially, we decided that the concepts are the fundamental concepts, they can't come and go with time. That is, we are not going for each new product created by mankind, to create a separate concept in the hierarchy.

So the idea that there are concepts and individuals, is only valid in a pre-established framework and this approach cannot be used to create the global ontology.

These pitfalls are myriad, and I think that to create a universal way of describing the world is only possible by checks and changes.

the

Outro

I don't expect from a quick-impact project, as I said earlier – at the moment it's an experiment.
Many questions and problems are edge. Perhaps the global graph does not have a place to be. Or perhaps suggested approach is simply unsuitable for its creation.
The purpose of my activities is a practical way to find solutions to fundamental problems of the global web.

Everything I described above is just the tip of the iceberg of my findings and ideas. If the theme will be topical, I will try to continue the series.

I prilagodjen for feedback of any content! You can write in the comments, in a personal or form on the website (about bugs and hacks to post there).
If someone's ideas seem interesting, and that someone will want to take part in the development of the project – I am open for cooperation (the core of the site – Java + MySQL).

Under development I mean not only the development but also the content of the knowledge base. Now the network created around 1000 entities in various subject areas, which, of course, very little. If you can not find the page of your city, country, favorite music group, movie, etc., try to create a page and share your user-experience.

PS: Those who requested an invitation, don't be surprised if it will not come immediately. SMTP server – this is our bottleneck. You can write to me in PM – I will throw.

Thank you for your attention!
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

March Habrameeting in Kiev

PostgreSQL load testing using JMeter, Yandex.Tank and Overload

Monitoring PostgreSQL with Zabbix