Manual and Automatic Clustering with Thesaurus Generation
data is clustered in its natural order upon ingestion. However you are given a choice to have explicit clustering order by defining a clustering key. There are two ways of triggering reclustering activities for the explicit clustering: Manual clustering and automatic clustering.
What is Manual Clustering?
Manual clustering is the data on a clustering table being clustered by user on user specified warehouse using ALTER TABLE command.
For example, to recluster the whole table,
alter table t1 recluster;
To recluster the partial table with a filter,
alter table t2 recluster where create_date between ('2016-01-01') and ('2016-01-07');
The ALTER TABLE RECLUSTER command needs a user warehouse. The billing on manual recluster is through the specified warehouse on reclustering. The disadvantage is that the warehouse size may be too small or too big and it requires the end-user to control warehouse size. The rule of thumb on warehouse size is recommended to use the following formula to calculate the number of nodes/servers for the warehouse used.
Num of Nodes < table_size in GB / 30
That is : the number of nodes/servers is less than table size in GB divided by 30.
For Example, The table FACT_SLICES is about 133 GB (compressed), so the number of nodes should be less than 133/30 = 4.4. You can use a medium warehouse as large as you can go, small or x-small might be enough.
What is Automatic Clustering?
Automatic Clustering is the Snowflake service that automatically and continuously manages all reclustering, as needed, of clustered tables, instead of manual recluster as described in manual clustering in the previous section.
To enable automatic clustering, you just need to define a clustering key on the table. For example,
create or replace table t1 (c1 date, c2 string, c3 number) cluster by (c1, c2); alter table t1 cluster by (c1, c2); -- to add a clustering key after table is created.
You can also suspend or resume clustering using ALTER TABLE commands.
alter table t1 suspend recluster;
alter table t1 resume recluster;
What are Things in common with Manual and Automatic clustering?
There are lots of things in common between manual and auto clustering. Here are the list.
- Explicit defined clustering table
- Design the clustering keys
- Introduce both Compute and storage cost
- Same purpose
The purpose of both manual and automatic clustering is to help on query performance.
What are the Differences between Manual Clustering and Automatic Clustering?
Though there are many things in common between manual and automatic clustering, there are also many differences between them. The major differences are summarized in the table below.
Thesaurus (information retrieval)
In the context of information retrieval, a thesaurus (plural: “thesauri”) is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimize semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as “any item that is to be described for inclusion in an information retrieval system, website, or other source of information”.The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object.
- A thesaurus serves to guide both an indexer and a searcher in selecting the same preferred term or combination of preferred terms to represent a given subject. ISO 25964, the international standard for information retrieval thesauri, defines a thesaurus as a “controlled and structured vocabulary in which concepts are represented by terms, organized so that relationships between concepts are made explicit, and preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms.”
- A thesaurus is composed by at least three elements: 1-a list of words (or terms), 2-the relationship amongst the words (or terms), indicated by their hierarchical relative position (e.g. parent/broader term; child/narrower term, synonym, etc.), 3-a set of rules on how to use the thesaurus.
- In information retrieval, a thesaurus can be used as a form of controlled vocabulary to aid in the indexing of appropriate metadata for information bearing entities. A thesaurus helps with expressing the manifestations of a concept in a prescribed way, to aid in improving precision and recall.
- This means that the semantic conceptual expressions of information bearing entities are easier to locate due to uniformity of language. Additionally, a thesaurus is used for maintaining a hierarchical listing of terms, usually single words or bound phrases, that aid the indexer in narrowing the terms and limiting semantic ambiguity.
- The Art & Architecture Thesaurus, for example, is used by countless museums around the world, to catalogue their collections. AGROVOC, the thesaurus of the UN’s Food and Agriculture Organization, is used to index and/or search its AGRIS database of worldwide literature on agricultural research.