Johan Louwers - Tech blog: math

Showing posts with label math. Show all posts

Wednesday, August 19, 2009

Twitter Spam Trust Model

Some time ago I signed up for a twitter account as you could been reading on my weblog some time ago. I started using twitter just for fun and try to find out what everyone is talking about on twitter. After some time I became quite happy with the service and the information which can be found on twitter and the way you can interact with people you never have spoken to before and who might be unavailable to reach if it was not for twitter.

However, as with every good service after some time it will also be used to promote goods and services you might not want. You will be contacted by people in such a way that you can consider it spam. Twitter spam is currently in my opinion the biggest problem and threat to twitter and its growth. If people are using it they do not want to be annoyed with all kinds of spam messages. Some time ago I posted a tweet stating that twitter spam will be the next big fight. On this tweet I got some reactions via twitter and also offline. Some people stated that if this was the next fight in my opinion I should make a point by thinking about the subject and creating some kind of approach on how Twitter should fight this fight.

As twitter is just a message service from a person to one or more other persons some of the approaches designed for fighting email spam can be applied. Even some in a more effective way as all communication is happening inside the twitter.com domain. For example a trust model can be very easily applied, already used for email it can be used to fight twitter spam.

Trust model:
A trust model against twitter spam should find the relationship you as a sender is having with the person you are sending the tweet to. A Tweet Spam Rank (TSR) could be calculated for the tweet and the higher the TSP the lower the trust between the sender and the receiver. You can send a message to someone you do not have a relation with, this will provide you a high TSR however will not make you a spammer. To prevent the effect that you will be banned as a spammer due to the fact you send a single message to someone you have no relation with you should have a average TSR over time which is below the threshold of being identified as a spammer. However, the TSR calculation will have a big role in the spam fighting. Before explaining the TSR calculation first some basics on the twitter relation model and the components inside this model.

You, or the sending part, will be represented in the model with as the green dot, as you can see you can have several relations (or non relations) with other hops. Hops are other twitter users you send a message to or who are a bridge to other hops. The model in its current version will only go for two hops. So max a connection hop and a destination hop. To be sure if this is “deep” enough one should run some calculations on the twitter data.

As can been seen in the picture above there are four types of connections that can be made:

- T1, a connection with a hop and a connection back. You follow the tweets of this person and this person on his turn is following your tweet. As you both follow the other you most likely will have a strong connection so sending a message over this connection will result in a low TSR.

- T2, a connection from a remote hop to you. This person is following you and you do not follow him. So for some reason this person is interested in you so if you send a direct tweet to this person he or she will most likely be wiling to accept this. It is not as strong as a double connection however still a low TSR.

- T3, a non connection. You have no connection whatsoever to this person, not even via a connection hop so this will result in a high TSR score.

- T4, you follow a person however this person is not following you. So for some reason you have interest in the tweets from this person however this person is not following you. So a direct tweet to this person will result in a higher TSR.

Now we have to connect some values to the parts of the trust model so we can calculate the TSR of a message. For this we refer to the model as it is shown below. As you can see all possible relations within the trust model are represented in this diagram.

We start with the sending party, a sending party will have for calculation reasons the value 2. T1 connections will have a value of 5, T2 has a value of 10, T3 has a value of 100 and T4 a value 15. A connection hub will have a value of 5.

Now lets say you want to send a message to the user in hop B we can calculate the TSR like {you * T1} which will be {2*5} so this message will have a TSR of 10 which is the lowest TSR you can get. Meaning you just sent a message with a very low Twitter Spam Rank. However, sending a message to B1 will have a calculation like {you * T1 * connection-hub * T4} which is {2*5*5*15} meaning you will have a TSR of 750 for this message.

For example you can be sending a message to C1. You have a very weak connection with D2 so you should get a high TSR. {you * T4 * connection-hub * T4}, this results in {2*15*5*15} which results in a TSR of 900. This is the most weak connection you can have with a connection hop and two times a T4 connection. However, one exception on the rule is a T3 connection which will result in TSR of 1000 without any calculation needed to be done.

The entire model would make sense if people would behave and only play by the rules of the model above. However in a normal world you will see that multiple routes to a person are possible and we have to take this into account. You can see a example of this below.

In this example you see two possible routes to hop B3. You can take the route to B3 via connection hub B or via D. Based upon the model we can not state if B3 will appreciate your message because if he is willing to follow you he could have made a direct relation. So to get a correct TSR we have to calculate the average TSR of both connections, meaning you will have to calculate {(you * T1 * 2 * T1) + (you * T4 * 2 * T1) / 2 } This will give you the correct TSR for this message. We only do a average TSR calculation in case there is no direct connection, so even if there are multiple paths and a direct connection we will ingnore the other paths and only use the direct connection to calculate the TSR.

Now we have a good way model of calculation the value of relations within the model, however scoring a high TSR every now and then is not making you a spammer on Twitter. Every now and then you like to contact people you do not know and maybe build a stronger relation later in time. So we have to measure the TSR score within a time and tweet frame. Based upon the number of tweets, the time and the TSR you can start to determine if a person if a spammer. In a normal world you will see that a spammer will hit a lot of high TSR scores and a lot of the same scores on arrow while a normal human user will hit mostly low scores and the TSR scores differ a lot. This is a way how you can identify a spammer.

This model and the calculations are raw and not based on actual research on the twitter data, however, if access to Twitter data could be granted someone could complete this model and do some test drives on this and see what the exact behavior of a spammer is. The model can be tuned and perfected. Also I would like to point out that for example the growth of connections can be used in combination with TSR to determine the intentions of a Twitter user. To be precise, a spammer would like to have a large network very quickly so he most likely will add hundreds of connections within a short periode of time while this is not the case for most human users. So this also can be used in combination with TSR to identify spammers. I hope this blogpost will come to the attention of some people at twitter and that they are willing to give this a thought because I would be very disappointed if Twitter collapses under its own success and the spammers it attracts with this success.

Saturday, December 20, 2008

Partition Decoupling Method

When working on complex systems and trying to map them you will find that the relations within your data can become very complex very fast. Complex data which is time-dependent and interrelated will form a complex maze of data and dependencies which will become almost a impossible task of mapping.

A group of researchers from Darmouth have developed a mathematical tool which can help understand complex data systems like the votes of legislators over their careers, second-by-second activity of the stock market, or levels of oxygenated blood flow in the brain.

“With respect to the equities market we created a map that illustrated a generalized notion of sector and industry, as well as the interactions between them, reflecting the different levels of capital flow, among and between companies, industries, sectors, and so forth,” says Rockmore, the John G. Kemeny Parents Professor of Mathematics and a professor of computer science. “In fact, it is this idea of flow, be it capital, oxygenated blood, or political orientation, that we are capturing.”

Capturing patterns in this so-called ‘flow’ is important to understand the subtle interdependencies among the different components of a complex system. The researchers use the mathematics of a subject called spectral analysis, which is often used to model heat flow on different kinds of geometric surfaces, to analyze the network of correlations. This is combined with statistical learning tools to produce the Partition Decoupling Method (PDM). The PDM discovers regions where the flow circulates more than would be expected at random, collapsing these regions and then creating new networks of sectors as well as residual networks. The result effectively zooms in to obtain detailed analysis of the interrelations as well as zooms out to view the coarse-scale flow at a distance."
Source Press Release

In a paper named "Topological structures in the equities market network" written by Gregory Leibon, Scott D. Pauls, Danile Rockmore and Robert Savell the Partition Decoupling Method is used to map the underlying structure of the equities market network.

"We present a new method for the decomposition of complex systems given a correlation network structure which yields scale-dependent geometric information — which in turn provides a multiscale decomposition of the underlying data elements. The PDM generalizes traditional multi-scalar clustering methods by exposing multiple partitions of clustered entities. "

More information can be found at:
http://www.sciencedaily.com/releases/2008/12/081216131022.htm
http://www.dartmouth.edu/~news/releases/2008/12/16.html
http://arxiv.org/pdf/0805.3470

Sunday, March 16, 2008

Adaptive Algorithms for Online Optimization

Google released a new google techtalk video and now on the subject of Adaptive Algorithms for Online Optimization presented by Seshadhri Comandur a Research Scientist. A list of his publications can be found at his website.

ABSTRACT

The online learning framework captures a wide variety of learning problems. The setting is as follows - in each round, we have to choose a point from some fixed convex domain. Then, we are presented a convex loss function, according to which we incur a loss. The loss over T rounds is simply the sum of all the losses. The aim of most online learning algorithm is to minimize *regret* : the difference of the algorithm's loss and the loss of the best fixed decision in hindsight. Unfortunately, in situations where the loss function may vary a lot, the regret is not a good measure of performance. We define *adaptive regret*, a notion that is a much better measure of how well our algorithm is adapting to the changing loss functions. We provide a procedure that converts any standard low-regret algorithm to one that provides low adaptive regret. We use an interesting mix of techniques, and use streaming ideas to make our algorithm efficient. This technique can be applied in many scenarios, such as portfolio management, online shortest paths, and the tree update problem, to name a few.

Sunday, February 17, 2008

The Monty Hall problem

A pub is (A) the best place to have a discussion (B) the worst place to have a discussion. A couple of days ago I had a lengthy discussion with a friend in a pub about the Monty Hall Problem. In short the Monty Hall problem is the following:

Imagine yourself in a gameshow, you can pick out of 3 doors, after 2 doors will be a goat after 1 door you will have a brand new car. Lets say you pick door 1, now the showhost is opening an other door and asks you if you want to stay with your original pick or you still want to switch. In my opinion switching will increase your changes of winning. The pictures below will ilustrate why.

In the beginning situation you will have a 1/3 change of picking the correct door. When you pick this door the change that the car is behind one of the other doors is 2/3 (2*1/3). Now the showhost is opening one of those doors so you know that you that the car is not behind this door and the 2/3 change is now on one single door. Making it a better pick than your original door, which still has a 1/3 change.

This little puzzle is the source for some discussion, I had the discussion and a lot of other people did also have very very heated debates about this. Try to make up your own mind and google some on the topic. A nice brain trainer.....

Wednesday, October 03, 2007

Women Scared Away From Math?

Are Women Being Scared Away From Math, Science, And Engineering Fields? Have you ever felt outnumbered? Like there are just not that many people like you around? We’ve all felt outnumbered in one situation or another and walking into a situation in which you sense the possibility of being ostracized or isolated can be quite threatening.

One group that may experience this kind of threat is women who participate in math, science, and engineering (MSE) settings- settings in which the gender ratio is approximately 3 men to every 1 woman. Recently, in the wake of comments made by former Harvard University President, Larry Summers, suggesting that women may not possess the same “innate ability” or “natural ability” in these fields as do men, several leading scientific institutions and university presidents publicly lamented the underrepresentation of women in Math, Science and Engineering fields and put out a call to study the reasons for the numbers gap in these areas.

While previous research offers biological and socialization explanations for differences in the performance and representation of men and women in these fields, Stanford psychologists, Mary Murphy and Claude Steele argue that the organization of Math, Science and Engineering environments themselves plays a significant role in contributing to this gap. Murphy contends that situational cues (i.e. being outnumbered) may contribute to a decrease in women’s performance expectations, as well as their actual performance.

Murphy and colleagues showed a group of advanced MSE undergraduates a gender balanced or unbalanced video depicting a potential MSE summer leadership conference. To assess identity threat, the researchers measured the participant’s physiological arousal during the video, cognitive vigilance, sense of belonging and desire to participate in the conference.

The results are telling. The women who watched the gender unbalanced video- where women were outnumbered by men in a 3 to 1 ratio- experienced faster heart rates, higher skin conductance (sweating), and reported a lower sense of belonging and less desire to participate in the conference.

They also found that women were more vigilant to their physical environment when they watched the video in which women were outnumbered. Throughout the testing room, Murphy planted cues related to Math, Science, and Engineering such as magazines like Science, Scientific American, and Nature on the coffee table and a portrait of Einstein and the periodic table on the walls. Women were able to recall more details about the video and the test room, indicating that they paid more attention to the identity-relevant items in order to assess the likelihood of encountering identity threat. “It would not be surprising if the general cognitive functioning of women in the threatening setting was inhibited because of this allocation of attention toward MSE-related cues,” write the authors. Thus, it is likely that this kind of attention allocation would interfere with performance and might help explain the performance gap between men and women in these fields.

While men, in either condition, showed no significant difference in physiological arousal, cognitive vigilance, or sense of belonging, both men and women expressed more desire to attend the conference when the ratio of men to women was balanced. Murphy says that while it’s interesting that both men and women want to be where the women are, the motivations of men and women for wanting to be there are probably quite different. “Women probably feel more identity-safe in the environment where there are more women- they feel that they really could belong there- while men might simply be attracted by the unusual number of women in these settings. Men just aren’t used to seeing that many women in these settings, because the numbers in real Math, Science, and Engineering settings are so unbalanced.”

These findings, which appear in the October issue of Psychological Science, a journal of the Association for Psychological Science, demonstrates that rather than being endemic to women the experience of identity threat in MSE settings is attributable the situation.

This research underlies the importance of situational cues and Murphy hopes that it will "inspire greater motivation to attend to such cues when creating and modifying environments so that they may foster perceptions of identity safety rather than threat."

Note: This story has been adapted from material provided by Association for Psychological Science.