Optimising Spam Filters

Aus Thomas Wiki
Zur Navigation springen Zur Suche springen

Optimising Your Spam Filter

In the following discussion we will generally follow the development of my ideas in the order of occurence. That is first in first out. There may be better ways to present the matter but I think it's easier to follow my ideas.

In contrast to this way I will introduce later ideas based on comments about this text. Because I got a comment that I'm totally wrong in the first part, I will add a small example not related to spam.

The Problem of Predicting

This chapter was inserted later to give an example which doesn't use the word spam.

Let us assume you found a gift shop with surprise gifts. Because you are tremenduously rich you bought 1000 gifts for your girlfriend. You made a statistic and found out that 100 gifts were packed in red paper. Your girlfriend disliked 90 of them and was only happy about 10.

But your parcels aren't only single-colored. They are multi-colored. To save space and processing time you didnt't count the result of two or more colors. If a parcel is blur,red and green you count the result "girlfriend happy / unhappy" for each color seperately. And you found that out of 400 green parcels 390 made your girlfriend happy and 10 not.

Next time the nice shop-girl offers you a red parcel. What's the probability that your girlfriend will be happy about the present?

Based on your statistical data 10% [ = 10/(90+10) ] will be a good assumption for this probability.

You should ask the nice girl behind the counter for another present packed in a different color.

This example resembles a our spam problem. The spam filter can only see the paper. We are looking for good indicators which tell us from parts of the message if they contain spam.

After you rejected the red parcel the nice shop-girl offered you a red-green parcel. What is the probability that your girlfrind will be happy about the parcel?

Unfortunately you have no statistical data about pairs of colors. You can't build a new statistic because your girlfriend removed all packing.

I will come to this problem later on. To prepare the way we will have a look on the task gathering the statistical data.

The Best Ham to Spam Ratio for Bayesian Filters

When using Bayesian Filters for spam detection two questions are frequently raised. What is the best amount of ham and spam that should be trained? What is the best ratio between ham and spam?

One continuously read myth is the 1:1 ratio.

I read an article about the best ratio as 1: 1 and it was expirienced by a test and later on derived from the bayesian theorem. Unfortunately I didn't copy this article and I can't remember enough to find the article by googling.

The problem is: The conclusion of the article is wrong.

What I will try to show in the next steps - which unfortunately require a little bit math - is:

Train your bayes filter in accordance with your real spam to ham ratio and train as much as possible. But never train to less ham or train only spam!

Off we go into the bayesian theorem.

In short the bayes theorem says

p(spam|token) = p(token|spam)*p(spam)/p(token)

This means: the probability of a message being spam under the contition that a token is in the message
is equal to
the propability that the token is contained in a spam message
multiplied by
the propability of a message being spam
devided by
the propability that a message contains the token.

If you have received a number of s spam messages and h ham messages where the token is in S spams and in H ham messages then you will get:

s = number of spam messages
h = number of ham messages
S = number of spam messages containing the token
H = number of ham messages containing the token
s+h = total number of messages
S+H = total number of messages containing the token

Thus it appears that

p(spam) = s/(s+h)

is an aproximation of the probability of a random message being spam. And:

p(token) = (S+H)/(s+h)
p(token|spam) = S/s

From this follows that

p(spam|token) = S/s * s/(s+h) / ((S+H)/(s+h)) = S/(S+H)

This means that the probability of a given message containing the token being spam is independend of the total number of messages trained.

Before we look at some examples let us compare this result with the spam probability Paul Graham proposed in "A Plan for Spam".


" Paul Graham multiplied the ham count by 2 to avoid False Positives. p(S)=S/(S+2+H). But he restricted the minimum and maximum probability to 0.01 and 0.99. If a token was never found in a ham message he forced a ham count of 1. He required that s+2h > 5 to avoid high spam probabilities on low data. For tokens never seen before he assumed a spam probability of .4 based on expirience. This means he left the true statistical way for an heuristic approach. If you take only tokens into accout for which you have enough data you can avoid the multipling by 2 and use the statistical value of S/(S+H). Therefor I will not follow Paul Grahams way. And I will allow a ham count of 0.

Our equation p(S)=S/(S+H) shows that yout predicted spam probability will raise, if you count only parts of your ham messages and will sink if you count only parts of the spam messages. (Ignoring the boundary conditions this is also true for Paul Grahams spam-probability).

Now to the example based on one token.

How much ham and spam should we train to get good statistical data for one token? And what happens

Lets assume in your real spam ham ratio is 10 to 1 and your messge body contains 1100 messages. 100 spam and also 50 ham messages should contain a certain token. May it be "honey". We will train all messages.

  Overall Trained "honey" "honey" trained
Ham 100 100 50 50
Spam 1000 1000 100 100
Sum 1100 1100 150 150

You will get a propability of 100 / (100+50) = 66.6% for the next message containing the token of being spam. Which isn't a high probability but works fine for this example.

Now we will train only 10% of your spam to get a spam to ham ration of 1:1. You will supposably count only 10 spam messages with the token. The remaining 90 messages are not trained.

10%spam
trained
Overall Trained "honey" "honey"
trained
Ham 100 100 50 50
Spam 1000 100 100 10
Sum 1100 200 150 60

Which leads to a spam probability of only 10 /(10+50) = 16.6% Which is lower than before.

Conclusion:

When you train less spam then you will get a lower spam probability for a message. In other words: Your False Negative rate will rise.

What happens when you train less ham?

Lets assume you train only 50% of your ham but all your spam. You will supposably count only 25 ham messages with the token.

50%ham
trained
Overall Trained "honey" "honey"
trained
Ham 100 50 50 25
Spam 1000 100 100 100
Sum 1100 1050 150 125

Which leads to a spam probability of 100 /(100+25) = 80% and is the highest.

Conclusion:

When you train less ham you will get a higher spam probability for a message. In other words: Your False Positive rate will rise.


BTW: The overall spam probability without a test is 1000/1100 = ~91%.

What happens when your ham to spam ratio is 10 to 1?

100%
trained
Overall Trained "honey" "honey"
trained
Ham 1000 1000 100 100
Spam 100 100 50 50
Sum 1100 1050 150 150

=> 50 / (50+100) = 33.3%

Now lets train only 50% of our spam.

50% spam
trained
Overall Trained "honey" "honey"
trained
Ham 1000 1000 100 100
Spam 100 50 50 25
Sum 1100 1050 150 125
=> 25 / (25+100) = 20.0%

Now lets train only 10% of our ham.

10% ham
trained
Overall Trained "honey" "honey"
trained
Ham 1000 100 100 10
Spam 100 100 50 50
Sum 1100 200 150 60
=> 50 / (50+10) = 83.3%

A last example

What happens when your ham to spam ratio is extremly high. Lets assume 1:100?

100%
trained
Overall Trained "honey" "honey"
trained
Ham 100 100 50 50
Spam 10000 10000 100 50
Sum 10100 10100 150 150
=> 100/(100+50) = 66.6%

Every second ham message contains the token and one of 100 spam messages. That means if you get a message with the token it is in 2 of 3 cases spam! What bayesian filter says: I got a message which contains a special token and in the past it was in 2 of 3 times spam.

What happens when you train only 100 spam massages to get a ratio of 1:1.

1% spam
trained
Overall Trained "honey" "honey"
trained
Ham 100 100 50 50
Spam 10000 100 100 1
Sum 10100 200 150 51
=> 1 / (50+1) = 1.9%

The bayesian filter will now say that the message is with a probability of 1.9% spam and with 98.1% ham. The filter is useless it declares everything to ham. If the ratio is vice versa way it declares everything as spam.

Conclusion

If you train to less spam you will get a higher False Negative rate, if you train to less ham you will get a higher False Positive rate.

Because a False Positive is more harmfull than a False Negative my final conclusion


Train your Bayesian Filter iaw your real spam to ham ratio. And never train to less ham or train only spam!

BTW: To avoid False Positives Paul Graham in "A Plan for Spam" multiplied his token counts for ham by 2. (I remember having written this above.)

Another lesson should be: Never train whitelisted mails as ham!!!

This is true for one token. It should be true for any token and I assume therefor it's true for all tokens.


Spam probability when two or more rules or tokens hit

!!! The following part is under development. It may contain errors. In the moment I assume that it contains at least one error. Comments happy welcome. They may save me time to find the errors.!!! I don't have tycos in mind.

After we have seen what happens to the spam probability of a message hit by a single token the question raises: "What is the spam probability of a message where two or more tokens hit". You can find <a href="http://www.paulgraham.com/naivebayes.html" target="_blank">a simple answer</a> from Paul Graham and a deeper discussion at <a href="http://www.mathpages.com/home/kmath267.htm" target="_blank">www.mathpages.com</a>. The problems discussed on the mathpages is similar to Bayesian Filters. Only similar. Before we continue let us have a quick look into the differences of rule based spam detection and Bayesian Filters.

<thead align="center" valign="middle" bgcolor="gray"> </thead>
Bayesian versus Rule based filtering
Baysian Rules
All mails are automatically devided into tokens which are counted and stored in a database. Rules where writen by humans based mostly on spam and combined in a ruleset.
A token is a pattern which occured in previous messages. A rule is a regular expression which fits to a generalisation of patterns found in previous messages and may take care of future obfuscations (viagra, v1agr@, ...) Rules can check for a special form of the message (HTML, attachments, ...) It can also be an "external" query to a black list or whitelist.
A spam probability is assigned to every token based on the counts. A score is assigned to a rule or a combination of rules based on mass-checks or certain feeling.
Tokens hit spam and ham iaw their probability. Rules are designed to hit nearly no ham.

When you do a mass-check you get a statistic how many spam and ham where hit by your rules. There is no difference to the Bayesian Filter database. And from these statistic you can deduce that a message is with x% probability spam when the rule hits. For spam detection only rules with a very high probability for spam are selected. You can transform your basian filter into a ruleset by generating a rule for every token.

To keep it short: Bayesian Filters are "automated ruleset generators" that calculate the scores from the probability of hits.

Therefor I will not differentiate between tokens or rules. Instead of token T from now I will speak about test T.

Before we go into the details I will make some remarks obout the notation used. For easy reading I have to modify the notation used above a little bit. Capital letters A, B, C ... will indicate an event. S will indicate the event of a message being spam, H of beeing ham. Ti is the event of a token Ti is in a message.

Lower case letters a, b, c ... will indicate the number of times an event A, B, C ... has been counted in our statistical data. In case of conditional probability I will use [] to group the expression. [a|B] is the number of events A where B has been realised. For easy reading we define si as the spam count and hi as the ham count of token Ti.

p(A) is the probability of event A. p(B|A) ist the probility of event B when event A has been realised.

Our statistical data gives us the following probabilities:

<thead> </thead>
Probability Meaning Approximation
p(S) probability of a message being spam s/(s+h)
p(H) probability of a message being ham h/(s+h)
p(Ti) probability of a message containing token Ti ti/(s+h)
p(H|Ti) probability of a message beeing ham when token Ti is in the message. [h|Ti]/([s|Ti]+[h|Ti])
     or hi/(si+hi)
p(S|Ti) probability of a message beeing spam when token Ti is in the message. [s|Ti]/([s|Ti]+[h|Ti])
or si/(si+hi)

Now let us formalize our question. What is the probability of a message being spam, when test T1 and T2 are true spam. In our notation we seek for p(S|T1;T2) or more general p(S|T1;..;Tn) with n>1 n in N.

What can we do?

1. p(S;T1;T2) = p(T1)*p(T2|T1)*p(S|T1;T2)

p(S|T1;T2) is what we are looking for. Unfortulately we can only get p(T1) from our data. We ignored the fact that two or more tokens occured in a message. Our equation is underspecified.

There may be another possibility:

2. p(S;T1;T2)=p(T1)*p(S|T1)*p(T2|T1;S)

The probability of p(S|T1) and is in our data but not p(T2|T1;S). So we have 2 equations with 4 unknowns. We need more equations to solve the problem.

3. p(S;T1;T2) = p(T2)*p(S|T2)*p(T1|T2;S)

The probability of p(S|T2) is in our data but not p(T1|T2;S). Now we have 3 equations with 5 unknowns. The equation is still underspecified.

To make it short: We lost data and there is no way to get it back. In this case we can make some (we need 2) assumptions to get rid of the unknowns.

First assumption T1 and T2 are independent:

p(T1|T2)=p(T1) and 
p(T2|T1)=p(T2)

It follows:

p(T1;T2) = p(T1| T2)*p(T2) = p(T1)*p(T2)

Is this a good assumption? Maybe with two public-opinion polls. But with spam? Let T1 be the occurence of "rolex" and T2 be the occurence of "watch" in a mail. I assume that p(T2|T1) is nearly 1.

Before we procede lets have a look what happens if p(T2|T1)=1. This means if T1 occure then T2 will also occure. With p(¬T2|T1) = 1 - p(T2|T1) = 1 - 1 = 0 the probability of p(S;¬T2;T1) = p(S|¬T2;T1)*p(¬T2|T1)*p(T1) = 0. There is no spam where T1 but not T2 occures.

If p(T2|T1) = 1 then p(T2|T1;S) = 1.

With equation 2. we get
p(S;T1;T2) = p(T1)*p(S|T1)*p(T2|T1;S) = p(T1)*p(S|T1)*p(T2|T1) = p(T1)*p(S|T1)

With equation 1. we get

p(T1)*p(S|T1) = p(T1)*p(T2|T1)*p(S|T1;T2)

p(S|T1) = p(S|T1;T2)


This means that T2 does not influence the spam probability if T1 occurs. This means not that test T2 is useless. We didn't assume that p(T1|T2) = 1. But if p(T1|T2) = 1 then p(S|T1T2) = p(S|T2) = p(S|T1)

Conclusion:

If two tests corralate to each other then we get no better prediction when we combine them. One of them is useless.

We will leave this path and go back to our previous assumption.

First assumption T1 and T2 are independent:

p(T1|T2)=p(T1) and
p(T2|T1)=p(T2)

This helps a bit because we can eliminate one unknown from our equations.

Second assumption T1 and T2 are also under the condition of S are independent:

p(T1|T2;S) = p(T1|S) and 
p(T2|T1;S) = p(T2|S)

Now we get:

1a. p(S;T1;T2) = p(T1)*p(T2)*p(S|T1;T2)
2a. p(S;T1;T2) = p(T1)*p(S|T1)*p(T2|S)
3a. p(S;T1;T2) = p(T2)*p(S|T2)*p(T1|S)

to be continued.