Wednesday, August 19, 2009

Twitter Spam Trust Model

Some time ago I signed up for a twitter account as you could been reading on my weblog some time ago. I started using twitter just for fun and try to find out what everyone is talking about on twitter. After some time I became quite happy with the service and the information which can be found on twitter and the way you can interact with people you never have spoken to before and who might be unavailable to reach if it was not for twitter.

However, as with every good service after some time it will also be used to promote goods and services you might not want. You will be contacted by people in such a way that you can consider it spam. Twitter spam is currently in my opinion the biggest problem and threat to twitter and its growth. If people are using it they do not want to be annoyed with all kinds of spam messages. Some time ago I posted a tweet stating that twitter spam will be the next big fight. On this tweet I got some reactions via twitter and also offline. Some people stated that if this was the next fight in my opinion I should make a point by thinking about the subject and creating some kind of approach on how Twitter should fight this fight.

As twitter is just a message service from a person to one or more other persons some of the approaches designed for fighting email spam can be applied. Even some in a more effective way as all communication is happening inside the domain. For example a trust model can be very easily applied, already used for email it can be used to fight twitter spam.

Trust model:
A trust model against twitter spam should find the relationship you as a sender is having with the person you are sending the tweet to. A Tweet Spam Rank (TSR) could be calculated for the tweet and the higher the TSP the lower the trust between the sender and the receiver. You can send a message to someone you do not have a relation with, this will provide you a high TSR however will not make you a spammer. To prevent the effect that you will be banned as a spammer due to the fact you send a single message to someone you have no relation with you should have a average TSR over time which is below the threshold of being identified as a spammer. However, the TSR calculation will have a big role in the spam fighting. Before explaining the TSR calculation first some basics on the twitter relation model and the components inside this model.

You, or the sending part, will be represented in the model with as the green dot, as you can see you can have several relations (or non relations) with other hops. Hops are other twitter users you send a message to or who are a bridge to other hops. The model in its current version will only go for two hops. So max a connection hop and a destination hop. To be sure if this is “deep” enough one should run some calculations on the twitter data.

As can been seen in the picture above there are four types of connections that can be made:

- T1, a connection with a hop and a connection back. You follow the tweets of this person and this person on his turn is following your tweet. As you both follow the other you most likely will have a strong connection so sending a message over this connection will result in a low TSR.

- T2, a connection from a remote hop to you. This person is following you and you do not follow him. So for some reason this person is interested in you so if you send a direct tweet to this person he or she will most likely be wiling to accept this. It is not as strong as a double connection however still a low TSR.

- T3, a non connection. You have no connection whatsoever to this person, not even via a connection hop so this will result in a high TSR score.

- T4, you follow a person however this person is not following you. So for some reason you have interest in the tweets from this person however this person is not following you. So a direct tweet to this person will result in a higher TSR.

Now we have to connect some values to the parts of the trust model so we can calculate the TSR of a message. For this we refer to the model as it is shown below. As you can see all possible relations within the trust model are represented in this diagram.

We start with the sending party, a sending party will have for calculation reasons the value 2. T1 connections will have a value of 5, T2 has a value of 10, T3 has a value of 100 and T4 a value 15. A connection hub will have a value of 5.

Now lets say you want to send a message to the user in hop B we can calculate the TSR like {you * T1} which will be {2*5} so this message will have a TSR of 10 which is the lowest TSR you can get. Meaning you just sent a message with a very low Twitter Spam Rank. However, sending a message to B1 will have a calculation like {you * T1 * connection-hub * T4} which is {2*5*5*15} meaning you will have a TSR of 750 for this message.

For example you can be sending a message to C1. You have a very weak connection with D2 so you should get a high TSR. {you * T4 * connection-hub * T4}, this results in {2*15*5*15} which results in a TSR of 900. This is the most weak connection you can have with a connection hop and two times a T4 connection. However, one exception on the rule is a T3 connection which will result in TSR of 1000 without any calculation needed to be done.

The entire model would make sense if people would behave and only play by the rules of the model above. However in a normal world you will see that multiple routes to a person are possible and we have to take this into account. You can see a example of this below.

In this example you see two possible routes to hop B3. You can take the route to B3 via connection hub B or via D. Based upon the model we can not state if B3 will appreciate your message because if he is willing to follow you he could have made a direct relation. So to get a correct TSR we have to calculate the average TSR of both connections, meaning you will have to calculate {(you * T1 * 2 * T1) + (you * T4 * 2 * T1) / 2 } This will give you the correct TSR for this message. We only do a average TSR calculation in case there is no direct connection, so even if there are multiple paths and a direct connection we will ingnore the other paths and only use the direct connection to calculate the TSR.
Now we have a good way model of calculation the value of relations within the model, however scoring a high TSR every now and then is not making you a spammer on Twitter. Every now and then you like to contact people you do not know and maybe build a stronger relation later in time. So we have to measure the TSR score within a time and tweet frame. Based upon the number of tweets, the time and the TSR you can start to determine if a person if a spammer. In a normal world you will see that a spammer will hit a lot of high TSR scores and a lot of the same scores on arrow while a normal human user will hit mostly low scores and the TSR scores differ a lot. This is a way how you can identify a spammer.
This model and the calculations are raw and not based on actual research on the twitter data, however, if access to Twitter data could be granted someone could complete this model and do some test drives on this and see what the exact behavior of a spammer is. The model can be tuned and perfected. Also I would like to point out that for example the growth of connections can be used in combination with TSR to determine the intentions of a Twitter user. To be precise, a spammer would like to have a large network very quickly so he most likely will add hundreds of connections within a short periode of time while this is not the case for most human users. So this also can be used in combination with TSR to identify spammers. I hope this blogpost will come to the attention of some people at twitter and that they are willing to give this a thought because I would be very disappointed if Twitter collapses under its own success and the spammers it attracts with this success.

No comments: