I created two tables tag_subset and article_tag_subset to construct the idea subset for data mining. The overall tag attributes is 20,000+, and can not fully loaded into weka. Also most tag attributes are only used in several articles. Therefore selecting the popular tag attribute as the subset is more effecient for association rule mining process.
The tables in blue color stores the raw data extracted from xml file, and they were created in project 6. Table tag_subset holds the popular tag subset by counting the frequency of tag attributes in articles. Table article_tag_subset holds the subset article data object with the count of selected tag attributes. Below is the overall database schema.
In order to include as many data objects as possible, all the 2483 articles are loaded in the table article_tag_subset. Below is the statistic data for all articles data set.
total number of tags: | 21107 |
total number of articles: | 2483 |
number of tags in a article: | Min: 4 |
---|---|
Max: 30 | |
Mean: 21.975 | |
Stddev: 7.245 | |
frequency of tags in all articles: | Min: 1 |
Max: 2005 | |
Mean: 2.585 | |
Stddev: 18.613 |
I chose the top 100 popular tags out of 21107 for association mining. The popularity of the tag is measured by the frequency of the tag occuring in all articles. I sorted the popularity in the database by joining table articles, tags, article_tags. The subset of 100 popular tag is as below. Column name is the name of tag. Column count(idarticles) is the frequency of the tag.
name | idtags | count(idarticles) |
wikipedia | 217 | 2005 |
wiki | 225 | 1045 |
reference | 238 | 685 |
history | 211 | 437 |
research | 233 | 364 |
science | 357 | 315 |
programming | 745 | 238 |
culture | 374 | 236 |
people | 270 | 227 |
philosophy | 273 | 217 |
interesting | 561 | 211 |
article | 219 | 207 |
art | 420 | 197 |
politics | 678 | 183 |
design | 215 | 182 |
software | 434 | 172 |
language | 407 | 167 |
books | 268 | 165 |
psychology | 320 | 152 |
technology | 884 | 152 |
music | 236 | 150 |
theory | 323 | 149 |
math | 553 | 146 |
development | 743 | 146 |
computer | 404 | 145 |
religion | 699 | 138 |
literature | 276 | 134 |
education | 580 | 122 |
information | 552 | 117 |
business | 624 | 116 |
health | 336 | 113 |
mathematics | 554 | 105 |
writing | 271 | 102 |
definition | 423 | 102 |
fun | 535 | 100 |
internet | 617 | 99 |
inspiration | 723 | 99 |
encyclopedia | 1309 | 97 |
linux | 428 | 97 |
articles | 572 | 96 |
cool | 1484 | 95 |
web | 292 | 93 |
english | 509 | 93 |
economics | 514 | 92 |
travel | 1186 | 91 |
architecture | 889 | 90 |
reading | 468 | 89 |
physics | 1153 | 87 |
film | 632 | 86 |
network | 2682 | 82 |
biography | 563 | 81 |
video | 432 | 80 |
learning | 577 | 80 |
games | 397 | 80 |
algorithms | 2034 | 78 |
book | 373 | 78 |
- | 2140 | 77 |
security | 1109 | 77 |
tools | 251 | 76 |
opensource | 444 | 75 |
movies | 786 | 74 |
hardware | 252 | 73 |
words | 1654 | 71 |
work | 667 | 70 |
funny | 287 | 70 |
social | 891 | 70 |
management | 792 | 69 |
free | 972 | 68 |
usa | 239 | 68 |
info | 395 | 68 |
biology | 1066 | 67 |
game | 1770 | 67 |
ideas | 418 | 66 |
school | 719 | 66 |
tech | 817 | 66 |
list | 489 | 66 |
artist | 892 | 65 |
statistics | 518 | 65 |
algorithm | 1502 | 63 |
web2.0 | 1256 | 62 |
photography | 814 | 60 |
media | 886 | 60 |
future | 2262 | 60 |
food | 981 | 60 |
computers | 1735 | 59 |
mind | 622 | 58 |
read | 234 | 57 |
finance | 869 | 57 |
society | 501 | 57 |
humor | 237 | 56 |
movie | 855 | 56 |
windows | 1623 | 55 |
artists | 1341 | 55 |
computing | 294 | 55 |
thesis | 854 | 55 |
ai | 1265 | 54 |
of | 704 | 54 |
@wikipedia | 591 | 54 |
todo | 2443 | 53 |
brain | 1067 | 53 |
The table article_tag_subset then constructs its columns from the selected 100 tag attributes. Each data object has 100 tag attributes. The numeric value of attributes indicates the original count number of the tag. If the article does not have the tag, the value is 0. I developed a python script to extract the data from the related table. The main function is below.
def add_tag_column(self): try: if self.cursor.execute('select name,idtags from popular_tags'): rows=self.cursor.fetchall() for row in rows: try: self.admincursor.execute ('ALTER TABLE article_tag_subset ADD COLUMN `%s` INT(11) NOT NULL DEFAULT 0' % (row[0])); self.admincursor.execute ('update article_tag_subset, article_tags set %s=`count` where article_tag_subset.idarticles=article_tags.idarticles and idtags=%d' % (str(row[0]), int(row[1]))) self.adminconn.commit() except Exception, xx: traceback.print_exc() except Exception,x: traceback.print_exc() return -1
Here is an data object from table article_tag_subset. As you will see, only a few attributes are present in the data object, the rest attributes are non-present.
idarticle_tag_subset | idarticles | wikipedia | wiki | reference | history | research | science | programming | culture | people | philosophy | interesting | article | art | politics | design | software | language | books | psychology | technology | music | theory | development | math | computer | religion | literature | education | information | business | health | mathematics | writing | definition | fun | inspiration | internet | linux | encyclopedia | articles | cool | web | english | economics | travel | architecture | reading | physics | film | network | biography | video | games | learning | algorithms | book | - | security | tools | opensource | movies | hardware | words | work | funny | social | management | free | usa | info | biology | game | ideas | school | tech | list | artist | statistics | algorithm | web2.0 | photography | media | future | food | computers | mind | read | finance | society | humor | movie | computing | thesis | artists | windows | of | @wikipedia | ai | brain | todo |
1 | 15 | 3 | 1 | 0 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |