sql - Build a more intuitive chat filter in PHP
Get the solution ↓↓↓Profanity API
I have built a basic profanity API that echoes a 1 if it identifies any, and a 0 if the message is okay. I run into some silly problems though.
For example, if the word hell is on my swear list it'll also identify words like hello as profanity.
Each word is in a txt file in this format
badword
badword
badword
lolanotherbadword
naughtyword
LeetSpeak
1 4l50 w4n7 70 1mpl3m3n7 50m3 50r7 0f l337 func710n, 50 7h47 1 d0n'7 h4v3 70 l157 3v3ry p0551bl3 v4r14710n 0f 7h3 w0rd. (I also want to implement some sort of leet function, so that I don't have to list every possible variation of the word.)
Bypassing the Chat Filter
Whether you access the API from
api.domain.tld/chat/profanity.php?access_token=whatever&filter_string=whatever
or
api.domain.tld/chat/profanity/access_token/filter_string
the same problem occurs. If people put an & or ? before their message it allows them to bypass the filter (and echoes a 0). When checking the logs I've noticed that messages that begin with an & or ? are logged as blank messages, so I'm guessing it's just messing up a variable or something.
Spacing
People think they are clever by saying h e l l or h e l l, etc. An intuitive chat filter would likely be able to identify this sort of thing.
Data Storage and Retrieval
I've also been thinking to myself if a txt file is really a valid storage and retrieval mechanism. Right now I've only got 400 words, but it'll keep growing and it's bound to be slow. What is better? An in-line PHP array, a txt file, or a database?
The Code
<?php
require('conn.php');
$date = gmdate('Y-m-d');
$time = gmdate('h:i:s');
$access_token = $_GET["access_token"];
$filter_string = $_GET["filter_string"];
function wordsExist(&$string, $words)
{
foreach ($words as &$word) {
if (stripos($string, $word) !== false) {
return true;
}
}
return false;
}
if (isset($access_token)) {
$sql = "SELECT * FROM api WHERE access_token='" . $access_token . "'";
$sql2 = "UPDATE api SET calls = calls + 1 WHERE access_token='" . $access_token . "'";
$sql3 = "UPDATE api SET last_query = CURRENT_TIMESTAMP WHERE access_token='" . $access_token . "'";
$sql4 = "UPDATE api SET profanity_api_calls = profanity_api_calls + 1 WHERE access_token='" . $access_token . "'";
$sql5 = "UPDATE api SET last_profanity_query = CURRENT_TIMESTAMP WHERE access_token='" . $access_token . "'";
$sql6 = "UPDATE api SET profanity_detected = profanity_detected + 1 WHERE access_token='" . $access_token . "'";
$result = mysqli_query($conn, $sql);
$result2 = mysqli_query($conn, $sql2);
$result3 = mysqli_query($conn, $sql3);
$result4 = mysqli_query($conn, $sql4);
$result5 = mysqli_query($conn, $sql5);
if (mysqli_num_rows($result) >= 1) {
if (wordsExist($filter_string, file('curse-list.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES))) {
$result6 = mysqli_query($conn, $sql6);
file_put_contents('logs/profanity/' . $date . '-log.txt', "1 [$time] $filter_string\n", FILE_APPEND);
echo '1';
} else {
file_put_contents('logs/profanity/' . $date . '-log.txt', "0 [$time] $filter_string\n", FILE_APPEND);
echo '0';
}
}
}
mysqli_kill();
mysqli_close();
?>
My .htaccess
RewriteEngine On
RewriteRule ^profanity/(.*)/(.*)$ profanity.php?access_token=$1&filter_string=$2
RewriteRule ^advertising/(.*)/(.*)$ advertising.php?access_token=$1&filter_string=$2
Escaping User input
As is - how secure is my above code implementation? If it's vulnerable could I have specific examples as of how hackers could abuse it?
Answer
Solution:
Here are a few quick changes you could make to the code which would solve some but not all issues.
1) Your code is vulnerable to SQL injection attacks where an attacker can craft urls that will become SQL queries and perform all kinds of unintended operations on your database. Fix those asap with:
$access_token = mysqli_real_escape_string($conn, $access_token);
2) Split your filter strings up into individual words, this will solve the hello issue. A client could use characters other than spaces between words. preg_split will let you specify a range of characters to split on.
$filter_words = preg_split("/[\s,\-_]+/", $string);
3) Test out fuzzy matching by using the soundex of words rather than exact text. In PHP soundex is a 4 character representation of the pronunciation of the input string. Anticipate that any fuzzy matching could generate some false positives.
if(soundex($filter_word) == soundex($word)) ...
Additional example of how to split words based on whitespace and underscores and compare with a list of words:
function wordsExist($filter_string, $words)
{
$filter_words = preg_split("/[\s,\-_]+/", $filter_string);
foreach ($words as $word) {
foreach($filter_words as $filter_word) {
if (
($filter_word == $word ) ||
(levenshtein($filter_word, $word) < 2) ||
(soundex($filter_word) == soundex($word))
) {
return true;
}
}
}
return false;
}
I've added in soundex and levenshtein as different ways of comparing words. In the few quick tests I did, I got some false positives so it is up to you to decide whether to keep those lines or not.
I also noticed you used the '&' operator to alias variables. This is different to '&' in C which is can be used to pass by reference. There is usually no performance benefit to aliasing since PHP postphones the copy process on variables until one of them is later written to. There is a good SO question on it: In PHP (
Share solution ↓
Additional Information:
Link To Answer People are also looking for solutions of the problem: php_network_getaddresses: getaddrinfo failed: temporary failure in name resolution
Didn't find the answer?
Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.
Similar questions
Find the answer in similar questions on our website.
Write quick answer
Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.