Modelling Big Data for serendipity

Last Friday we had a screening of TPBAFK movie here in Amsterdam – because it is nicer to be watch together and have a talk afterwards. So we did and in the aftermath I had a discussion with Erwin Blom and Lex Slaghuis on the impact of big data on personal experiences. I was thinking on the subject after the discussion, which was hard to share in some tweets, so I promised some thoughts as follow-up.

One of the points Erwin made was that profiling people based on based on big data could lead to misunderstanding of people. He always is suggested to listen to U2 Pearl Jam because he likes Depeche Mode Nirvana while he hates U2 Pearl Jam (I don’t be sure on the exact band names anymore, sorry about that. Please correct me :-). This is a danger, just like the misinterpretation of behavior in e-shops when you are buying for someone else and the collaborative filtering is advising you comparable titles, especially if it is stored in a profile that last longer than the session.

An extreme result of this kind of profiling is the talk of Tinkebell at TEDxAmsterdam where she showed how here online identity is now totally defined by people that misjudged her art pieces. And because our digital identity is merged with our real life identity this is a serious problem.

So I agree that there is a danger of profiling based on (big) data. But I think that this is in the end not a result of the principles of (big) data, but is caused by a bad use of the data and a bad design of the profiling system. The way to build good data based profiling and personal user experiences is by creating several layers of data intelligence.

The Big Data is especially used for creating a layer of related objects. In the example of the music, it is possible to combine the use of songs (listening) by people into a relational system where likely relations are defined. It is more on expected chance of relations than on hard relations though. It is a loose coupling that never can be connected to people. To make it clear. Every entry of new data by listening to songs adds more smartness to the system.

The hardest part is that the data model should not be built on majorities, but on single combinations. The relations are only used for serendipity, for suggesting new stuff to users.

There we enter the second layer, that of the user. The dataset with objects in this layer is not based on relations between the items, all objects stay on its own. The suggestions are not baked into the user profile, they are generated in every session and based on the suggestions in layer 1.
The result is that the profile of an user does not exist of advised objects, and is never defined by the machine. The combination of users in layer 2 and objects in layer 1 is always event based.
This also works for building relations between people. A good recommendation system is not trying to connect the profile from the user to another profile, it should search actively in similarities in the 1st layer triggered by the profiles of people.

A third layer is the stored intelligence. In layer 2 the objects of the user are stored, in layer 3 the behaviour is stored. Certain reactions to suggestions, especially high-level and low-level are kept to prevent repeated questions. All known reactions are stored, and so if Erwin has indicated one time that he does not like U2 it will not be presented anymore.

The last element is the active engine that connects the different layers. That engine should be smart too. The engine is connecting objects and preventing presenting double actions. At the same time it needs to create serendipity. It should ask for confirmation once in a while by breaking the rules of the third layer.

This approach asks of course more activity in the system, the performance is definitely a tough nut to crack, always. With the Big Data products like Hadoop and the supporting tools it could be possible to build this continuous ad-hoc profiling situation.

The most important characteristic of this Big Data for Tiny Services is the humbleness of the system to the user. The system should be designed in a dialogue that provides a system that is used by the user, not one that steers the use.