-
Notifications
You must be signed in to change notification settings - Fork 26
new functionalities for High Dimensionality problem and improved performance #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jjaranda13
wants to merge
17
commits into
exhuma:master
Choose a base branch
from
jjaranda13:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
7cec18a
Update README.rst
2252ead
Update README.rst
f1aeaba
Update README.rst
fe70bc2
Update README.rst
7223010
Update README.rst
25ea0d3
Update README.rst
f4d416c
Update AUTHORS
56f7da5
Update util.py
4dad547
Update kmeans.py
55ae158
Update util.py
2ae8e4b
Create HDdistances.py
ee2e62c
Create HDexample.py
f275e9d
Update HDexample.py
4b58056
Update HDdistances.py
juanrd0088 85e88d7
Merge pull request #1 from juanrd0088/master
4340c79
solved pull request issues
5d818c8
exuma request
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,6 @@ | ||
| Michel Albert (exhuma@users.sourceforge.net) | ||
| Sam Sandberg (@LoisaidaSam) | ||
| Sam Sandberg (@LoisaidaSam) | ||
|
|
||
| high dimensionality functionalities: | ||
| Jose J. GarciaAranda (@jjaranda13) | ||
| Juan Ramos Diaz (@juanrd0088) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # -*- coding: cp1252 -*- | ||
| ############################################################################### | ||
| # High Dimensionality problem example | ||
| # Authors: | ||
| # 2015 Jose Javier Garcia Aranda , Juan Ramos Diaz | ||
| # | ||
| ############################################################################### | ||
| # This High Dimensionality example creates N items (which are "users"). | ||
| # Each user is defined by his profile. | ||
| # A profile is a tuple of 10 pairs of keyword and weight ( 20 fields in total) | ||
| # weights are floating numbers and belong to 0..1 | ||
| # The summation of weights of a profile is normalized to 1 | ||
| # we consider 1000 diferent keywords | ||
| # A profile takes 8 keywords from first 200 keywords (the "popular" keywords) | ||
| # Each keyword is a dimension. Therefore there are 1000 possible dimensions | ||
| # A single user only have 10 dimensions | ||
| # Different users can have different dimensions. | ||
| # A new distance and equality function are defined for this use case | ||
| # | ||
| # cl = KMeansClustering(users,HDdistItems,HDequals); | ||
| # | ||
| # Additionally, now the number of iterations can be limited in order to save time | ||
| # Experimentally, we have concluded that 10 iterations is enough accurate for most cases. | ||
| # The new HDgetClusters() function is linear. Avoid the recalculation of centroids | ||
| # whereas original function getClusters() is N*N complex, because recalculate the | ||
| # centroid when move an item from one cluster to another. | ||
| # This new function can be used for low and high dimensionality problems, increasing | ||
| # performance in both cases | ||
| # | ||
| # solution = cl.HDgetclusters(numclusters,max_iterations); | ||
| # | ||
| # Other new available optimization inside HDcentroid() function in is the use of mean instead median at centroid calculation. | ||
| # median is more accurate but involves more computations when N is huge. | ||
| # The function HDcentroid() is invoked internally by HDgetclusters() | ||
| # | ||
| # The optional invocation of HDcomputeSSE() assist the computation of the optimal number or clusters. | ||
| # | ||
| # | ||
| from __future__ import print_function | ||
| from cluster import KMeansClustering | ||
| from cluster import ClusteringError | ||
| from cluster import util | ||
| from cluster.util import HDcentroid | ||
| from cluster.HDdistances import HDdistItems, HDequals, HDcomputeSSE, HD_profile_dimensions | ||
|
|
||
| import time | ||
| import datetime | ||
| import random | ||
|
|
||
| def createProfile(): | ||
| """create a profile composed of 10 dimensions chosen from 1000 dimensions""" | ||
| num_words=1000 | ||
| total_weight=0; | ||
| marked_word=[0]*num_words | ||
| repeated_word=False | ||
| list_profile=[] | ||
| returned_profile=(); | ||
| profile_aux=[]; | ||
| #10 pairs word, weight. | ||
| HD_profile_dimensions=10 | ||
| #Don't repeated words. | ||
| for i in range(8): | ||
| partial_weight=random.uniform(0,1) | ||
| total_weight+=partial_weight | ||
| repeated_word=False | ||
| while repeated_word==False: | ||
| random_word=random.randint(0,299) | ||
| if marked_word[random_word]==0: | ||
| marked_word[random_word]=1 | ||
| repeated_word=True | ||
| random_word= str(random_word) | ||
| tupla=[random_word,partial_weight] | ||
| list_profile.append(tupla) | ||
| for i in range(2): | ||
| partial_weight=random.uniform(0,1) | ||
| total_weight+=partial_weight | ||
| repeated_word=False | ||
| while repeated_word==False: | ||
| random_word=random.randint(300,999) | ||
| if marked_word[random_word]==0: | ||
| marked_word[random_word]=1 | ||
| repeated_word=True | ||
| random_word= str(random_word) | ||
| tupla=[random_word,partial_weight] | ||
| list_profile.append(tupla) | ||
| #Normalization of the profile | ||
| for i in range(5): | ||
| a=list_profile[i][0] | ||
| b=list_profile[i][1] | ||
| b=b/total_weight; #the sum of the weights must be 1 | ||
| profile_aux=([a,b]) | ||
| returned_profile+=tuple(profile_aux) | ||
| return returned_profile | ||
|
|
||
| #################################################### | ||
| # MAIN # | ||
| #################################################### | ||
| sses=[0]*10 #stores the sse metric for each number of clusters from 5 to 50 | ||
| num_users=100 | ||
| numsse=0 | ||
| numclusters=5 # starts at 5 | ||
| max_iterations=10 | ||
| start_time=datetime.datetime.now() | ||
| while numclusters<=50: # compute SSE from num_clusters=5 to 50 | ||
| users=[] # users are the items of this example | ||
| for i in range(num_users): | ||
| user = createProfile() | ||
| users.append(user) | ||
| print (" inicializing kmeans...") | ||
| cl = KMeansClustering(users,HDdistItems,HDequals); | ||
| print (" executing...",numclusters) | ||
| st=datetime.datetime.now() | ||
| print (st) | ||
| numclusters=numclusters | ||
| solution = cl.HDgetclusters(numclusters,max_iterations); | ||
| for i in range(numclusters): | ||
| a = solution[i] | ||
| print (util.HDcentroid(a),",") | ||
| st=datetime.datetime.now() | ||
|
|
||
| sses[numsse]=HDcomputeSSE(solution,numclusters) | ||
| numsse+=1 | ||
| numclusters+=5 | ||
| end_time=datetime.datetime.now() | ||
| print ("start_time:",start_time) | ||
| print ("end_time:",end_time) | ||
| print ("sses:",sses) | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
|
|
||
| """ This file provides functionalities for High dimensionality problems but also for low dimensionality problems | ||
|
|
||
| added functionalities: | ||
| - New Distance computation | ||
| - SSE metric computation for assist the computation of the optimal number of clusters | ||
|
|
||
| Authors: | ||
| Jose Javier Garcia Aranda | ||
| Juan Ramos Diaz | ||
| """ | ||
| import util | ||
| import time | ||
| import datetime | ||
| import random | ||
|
|
||
| HD_profile_dimensions=10 #dimensions per profile, default value is 10 | ||
|
|
||
| def HDdistItems(profile1,profile2): | ||
| """Distance function, this distance between two profiles is defined as: | ||
| For each keyword of user A, if the keyword is not present in user B , then the distance for this keyword is the weight in the user A. | ||
| If the keyword exists in both users, the weights are compared and the distance is the absolute difference. | ||
| For each keyword present in the union of keywords of both profiles, the distance is computed and added to the total distance between both users | ||
| """ | ||
|
|
||
| len1=len(profile1)/2 # len(profile1) is always pair because each dimension has a weight | ||
| len2=len(profile2)/2 # len(profile2) is always pair because each dimension has a weight | ||
| total_len=len1+len2 #this value usually is 20 | ||
| #factor_len=20.0/total_len #this only work if the profile has less than 10 keys | ||
| factor_len=2.0*HD_profile_dimensions/total_len #this only work if the profile has less than 10 keys | ||
| distance = 0.0 | ||
| marked=[0]*(total_len*2); | ||
| for i in range(len1): | ||
| found=False | ||
| for j in range(len2): | ||
| if profile1[i*2]==profile2[j*2]: | ||
| distance+=abs(profile1[i*2+1]-profile2[j*2+1]); | ||
| found=True; | ||
| marked[j*2]=1; | ||
| break; | ||
| if found==False: | ||
| distance+=profile1[i*2+1]; | ||
|
|
||
| for i in range(len2): | ||
| if marked[i*2]==1: | ||
| continue; | ||
| distance+=profile2[i*2+1] | ||
|
|
||
| distance=distance*factor_len | ||
| return distance | ||
|
|
||
| def HDequals(profile1,profile2): | ||
| for i in range(HD_profile_dimensions): | ||
| for j in range(HD_profile_dimensions): | ||
| if profile1[i*2]!=profile2[j*2]: | ||
| return False | ||
| elif profile1[i*2+1]!=profile2[j*2+1]: | ||
| return False | ||
| return True | ||
|
|
||
|
|
||
| def HDcomputeSSE(solution,numclusters): | ||
| """This metric measure the cohesion of users into a cluster and the separation among clusters at the same time""" | ||
|
|
||
| partial_solution=0 | ||
| total_solution=0 | ||
| dist=0 | ||
| for i in range(numclusters): | ||
| partial_solution=0 | ||
| for j in solution[i]: | ||
| dist=HDdistItems(util.HDcentroid(solution[i]),j) | ||
| partial_solution+=dist*dist | ||
| total_solution+=partial_solution | ||
| return total_solution |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.