MySQL Full-text Search
======================
be
created from `CHAR', `VARCHAR', or `TEXT' columns at `CREATE TABLE'
time or added later with `ALTER TABLE' or `CREATE INDEX'. For large
datasets, it will be much faster to load your data into a table that
has no `FULLTEXT' index, then create the index with `ALTER TABLE' (or
`CREATE INDEX'). Loading data into a table that already has a
`FULLTEXT' index will be slower.
Full-text searching is performed with the `MATCH()' function.
mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
-> );
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO articles VALUES
-> (NULL,'MySQL Tutorial', 'DBMS stands for DataBase ...'),
-> (NULL,'How To Use MySQL Efficiently', 'After you went through a ...'),
-> (NULL,'Optimising MySQL','In this tutorial we will show ...'),
-> (NULL,'1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
-> (NULL,'MySQL vs. YourSQL', 'In the following database comparison ...'),
-> (NULL,'MySQL Security', 'When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM articles
-> |
+----+-------------------+------------------------------------------+
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
to
`AGAINST()'. The search is performed in case-insensitive fashion. For
every row in the table, `MATCH()' returns a relevance value, that is, a
similarity measure between the search string and the text in that row
in the columns named in the `MATCH()' list.
When `MATCH()' is used in a `WHERE' clause (see example above) the rows
returned are automatically sorted with highest relevance first.
Relevance values are non-negative floating-point numbers. Zero
relevance means no similarity. Relevance is computed based on the
number of words in the row, the number of unique words in that row, the
total number of words in the collection, and the number of documents
(rows) that contain a particular word.
It is also possible to perform a boolean mode search. This is explained
later in the section.
The preceding example is a basic illustration showing how to use the
`MATCH()' function. Rows are returned in order of decreasing relevance.
The next example shows how to retrieve the relevance values explicitly.
As neither `WHERE' nor `ORDER BY' clauses are present, returned rows
are not ordered.
mysql>
| 1 | 0.64840710366884 |
| 2 | 0 |
| 3 | 0.66266459031789 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
+----+-----------------------------------------+
6 rows in set (0.00 sec)
The following example is more complex. The query returns the relevance
and still sorts the rows in order of decreasing relevance. To achieve
this result, you should specify `MATCH()' twice. This will cause no
additional overhead, because the MySQL optimiser will notice that the
two `MATCH()' calls are identical and invoke the full-text search code
only once.
mysql> SELECT id, body, MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root') AS score
-> FROM articles WHERE MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body | score |
+----+-------------------------------------+-----------------+
| 4 | 1. Never run mysqld as root. 2. ... | 1.5055546709332 |
| 6 | When configured properly, MySQL ... | 1.31140957288 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)
MySQL uses a very simple parser to split text into words. A "word" is
any sequence of characters consisting of letters, digits, `'', and `_'.
Any "word" that is present in the stopword list or is just too short
is ignored. The default minimum length of words that will be found by
full-text searches is four characters. This can be changed as described
in *Note Fulltext Fine-tuning::.
Every correct word in the collection and in the query is weighted
according to its significance in the query or collection. This way, a
word that is present in many documents will have lower weight (and may
even have a zero weight), because it has lower semantic value in this
particular collection. Otherwise, if the word is rare, it will receive
a higher weight. The weights of the words are then combined to compute
the relevance of the row.
may
sometimes produce bizarre results.
mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL');
Empty set (0.00 sec)
The search for the word `MySQL' produces no results in the above
example, because that word is present in more than half the rows. As
such, it is effectively treated as a stopword (that is, a word with
zero semantic value). This is the most desirable behaviour -- a
natural language query should not return every second row from a 1 GB
table.
A word that matches half of rows in a table is less likely to locate
relevant documents. In fact, it will most likely find plenty of
irrelevant documents. We all know this happens far too often when we
are trying to find something on the Internet with a search engine. It
is with this reasoning that such rows have been assigned a low semantic
value in *this particular dataset*.
As of Version 4.0.1, MySQL can also perform boolean full-text searches
using the `IN BOOLEAN MODE' modifier.
mysql> SELECT * FROM articles WHERE MATCH (title,body)
-> |
+----+------------------------------+-------------------------------------+
| 1 | MySQL Tutorial | DBMS stands for DataBase ... |
| 2 | How To Use MySQL Efficiently | After you went through a ... |
| 3 | Optimising MySQL | In this tutorial we will show ... |
| 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
| 6 | MySQL Security | When configured properly, MySQL ... |
+----+------------------------------+-------------------------------------+
This query retrieved all the rows that contain the word `MySQL' (note:
the 50% threshold is not used), but that do *not* contain the word
`YourSQL'. Note that a boolean mode search does not automatically sort
rows in order of decreasing relevance. You can see this from result of
the preceding query, where the row with the highest relevance (the one
that contains `MySQL' twice) is listed last, not first. A boolean
full-text search can also work even without a `FULLTEXT' index,
although it would be *slow*.
The boolean full-text search capability supports the following
operators:
`+'
A leading plus sign indicates that this word *must be* present in
every row returned.
`-'
A leading minus sign indicates that this word *must not be*
present in any row returned.
`'
By default (when neither plus nor minus is specified) the word is
optional, but the rows that contain it will be rated higher. This
mimicks the behaviour of `MATCH() ... AGAINST()' without the `IN
BOOLEAN MODE' modifier.
`< >'
These two operators are used to change a word's contribution to the
relevance value that is assigned to a row. The `<' operator
decreases the contribution and the `>' operator increases it. See
the example below.
`( )'
Parentheses are used to group words into subexpressions.
`~'
A leading tilde acts as a negation operator, causing the word's
contribution to the row relevance to be negative. It's useful for
marking noise words. A row that contains such a word will be rated
lower than others, but will not be excluded altogether, as it
would be with the `-' operator.
`*'
An asterisk is the truncation operator. Unlike the other
operators, it should be *appended* to the word, not prepended.
`"'
The phrase, that is enclosed in double quotes `"', matches only
rows that contain this phrase *literally, as it was typed*.
And here are some examples:
`apple banana'
find rows that contain at least one of these words.
`+apple +juice'
... both words.
`+apple macintosh'
... word "apple", but rank it higher if it also contain
"macintosh".
`+apple -macintosh'
... word "apple" but not "macintosh".
`+apple +(>turnover na("Источник: Документация\n F7D5C0D093C5DDC4F5AFFEF6D78E8DCD93C0D68AC1DCD1D5DFE7F6F1F981CB8BCADBD48682DECF88DED4E9E8E8FFDFC6DB8CD7C48FC8D1D68AC8C3F9F5EFE08384DFD29CBDE9D8DAD9DFC1CFA9F6BCF3CDC1CBC18290D8DDCA8496C1C0B2E3E5A3DE84CBCCDAC08288C989C9D19AB3C7C6C9E396C0E2DA89D1D0CA8080D8C1FDA2A4ACE7C8EFF1ECE7ACCDD1DFC2C2DBEDE6E4F4CF8FD4C4DCC2C4CD9384998ADAFAEEE1F2F2DBC5C384DBC7C2CFC88D90D5E1E9EDEF9986CA9B8DC5D59280CAC5D191EDA0E5F9D6") more
control over full-text searching behaviour.
Note that full-text search was carefully tuned for the best searching
effectiveness. Modifying the default behaviour will, in most cases,
only make the search results worse. Do not alter the MySQL sources
unless you know what you are doing!
The full-text variables described in the following list must be set at
server startup time. You cannot modify them dynamically while the
server is running.
* The minimum length of words to be indexed is defined by the MySQL
variable `ft_min_word_len'. *Note `ft_min_word_len': SHOW
VARIABLES. (This variable is only available from MySQL version
4.0.) The default value is four characters. Change it to the
value you prefer, and rebuild your `FULLTEXT' indexes. For
example, if you want three-character words to be searchable, you
can set this variable by putting the following lines in an option
file:
[mysqld]
ft_min_word_len=3
Then restart the server and rebuild your `FULLTEXT' indexes.
* The stopword list can be loaded from the file specified by the
`ft_stopword_file' variable. *Note `ft_stopword_file': SHOW
VARIABLES. Rebuild your `FULLTEXT' indexes after modifying the
stopword list. (This variable is only available from MySQL
version 4.0.10 and onwards)
#define GWS_IN_USE GWS_FREQ
Then recompile MySQL. There is no need to rebuild the indexes in
this case. *Note*: by doing this you *severely* decrease MySQL's
ability to provide adequate relevance values for the `MATCH()'
function. If you really need to search for such common words, it
would be better to search using `IN BOOLEAN MODE' instead, which
does not observe the 50% threshold.
* Sometimes the search engine maintainer would like to change the
operators used for boolean full-text searches. These are defined
by the `ft_boolean_syntax' variable. *Note `ft_boolean_syntax':
SHOW VARIABLES. Still, this variable is read-only; its value is
set in `myisam/ft_static.c'.
For full-text changes that require you to rebuild your `FULLTEXT'
indexes, the easiest way to do so for a `MyISAM' table is to use the
following statement, which rebuilds the index file:
mysql> REPAIR TABLE tbl_name QUICK;
Full-text Search TODO
---------------------
* Make all operations with `FULLTEXT' index *faster*.
* Proximity operators
* Support for "always-index words". They could be any strings the
user wants to treat as words, examples are "C++", "AS/400",
"TCP/IP", etc.
* Support for full-text search in `MERGE' tables.
* Support for multi-byte charsets.
* Make stopword list to depend of the language of the data.
* Stemming (dependent of the language of the data, of course).
* Generic user-suppliable UDF preparser.
* Make the model more flexible (by adding some adjustable parameters
to `FULLTEXT' in `CREATE/ALTER TABLE').
[Назад] [Содержание] [Вперед]
| Главная |