GeneWeaver Documentation

GeneWeaver is an online system for the integration of functional genomics experiments. The GeneWeaver web-based software system contains a database of experimental results and set of interactive tools for analysis and visualization. The database provides storage of functional genomics experiments in the form of gene sets, gene set descriptions and gene set association scores from multiple species. Currently 10 species are available, Mus musculus, Homo sapiens, Rattus norvegicus, Danio rerio, Drosophilia melanogaster, Macaca mulatta, Caenorhabditis elegans, Saccharomyces cervisiae, Gallus gallus, Canis familiaris.

GeneWeaver allows users to integrate diverse gene sets across species, tissue and experimental platform. Sets can be stored, shared and compared privately, among user defined groups of investigators, and across all users. Gene sets can come from many different sources, including but not limited to: Microarray expression studies, Gene Ontology annotations, Text Mining tools, conserved modules, candidate genes, or even just a list of genes that you find interesting. The gene sets are connected at a gene level, using homology to relate gene sets from different species into an integrated database. This database stores all the gene sets uploaded, and allows users to share the data with others, or restrict data to collaborative groups only. The database also provides a single integrated data source for comparative genomics tools.

PDF Version of the GeneWeaver 2.0 documentation.

Table of Contents

Getting Started

Uploading Gene Sets

While GeneWeaver contains over 30,000 sets of publicly available sets of genes, it is often the case that individual gene sets of interest are not among these. Registered users can upload Single Gene Sets and make use of the Batch Gene Set Upload process.

Users and Groups

GeneWeaver is available without registration to enable all users to search the database and analyze gene sets. Registered users can access several additional features including long-term storage of gene sets, projects and results. Registered users can also form groups, designate administrators and share gene sets, projects and results to the members of their user group.

From your accounts page find the Manage Groups section. Here you can select the appropriate icons to:

By selecting the Join Public Group icon at the bottom of this section, a modal will be displayed allowing you to join one of many publically available groups.

Curation

Controlling the quality and validity of the large-scale analysis of secondary data requires the enforcement of interpretable standards for gene set construction and description. GeneWeaver’s use of discrete analysis eliminates many barriers to the integration of heterogeneous data sets across species and experiments. However, it is important for users to be able to rapidly interpret the nature of gene sets retrieved from the site, requiring a minimal standard for metadata associated with secondary data. For this purpose, both unstructured textual descriptions of the data and structured ontology annotations to the terms in these descriptions are used to define gene sets.

Our Curation Standards provide detailed guidance to GeneWeaver curation policies and sample curation types. We have also included a brief explanation of the Curation Process, which includes a guide to our new curation interface.

Curation Standards Documentation

Secondary functional genomics data consists of the results of analyzed experiments in functional genomics. In contrast to primary data stores such as Gene Expression Omnibus (GEO) in which raw experimental data are stored, a secondary data store attempts to collect the results of experimental design and decision making process of the researcher so that one may interpret and integrate the gene set centered outcomes of the studies. Controlling the quality and validity of the large-scale analysis of secondary data requires the enforcement of interpretable standards for gene set construction and description. GeneWeaver’s use of discrete analysis eliminates many barriers to the integration of heterogeneous data sets across species and experiments. However, it is important for users to be able to rapidly interpret the nature of gene sets retrieved from the site, requiring a minimal standard for metadata associated with secondary data. For this purpose, both unstructured textual descriptions of the data and structured ontology annotations to the terms in these descriptions are used to define gene sets. In the interest of encouraging submission we are cautious not to be too prescriptive or burdensome to users, but rather to provide guidelines on standards used by internal curators to assess data quality and clarity to enable rapid acceptance of community submissions to the data repository.

Curation Tiers

Tier Name Curator Description
Tier I Public Resource Grade Resource GeneWeaver Large data sets primarily curated by their parent resource. GeneWeaver ensures consistency of metadata (gene annotations to KEGG, MP and GO, curated functional associations in the Neuroinformatics Framework, Comparative Toxicogenomics Database)
Tier II Machine-Generated from public sources GeneWeaver Gene sets resulting from genome analysis, not otherwise published in total, e.g. gene co-expression to behavior from GeneNetwork.org, QTL positional candidates from MGI. GeneWeaver curators examine data and metadata.
Tier III Human-Curated GeneWeaver Curated user-deposited data and publication supplements in domains of interest.
Tier IV Submitted to Public- Provisional User User-deposited data made available to the public. All Tier IV is examined for promotion to Tier III
Tier V Private User and Group data- Uncurated User Data sets deposited for private or group-only analysis
Tier Name Tier Description
Tier I
Public Resource Data
Tier I data are professionally curated into another major database and are imported into GeneWeaver,which ensures consistency of metadata. Resource grade data is updated on a six-month cycle. These include: gene annotations to KEGG, MP and GO, curated functional associations in Neuroinformatics Framework, and Comparative Toxicogenomics Database.
Tier II
Machine-Generated from public sources
Tier II data are computationally generated from data in public sources. These include empirical data obtained from public sources and their associated analytical tools, e.g. bulk analysis of gene co-expression to phenotypes across mouse strains from GeneNetwork.org, or QTL positional candidates from MGI. In contrast to Tier I in which the individual gene annotations to function are manually curated, Tier II includes machine generated gene annotations to functions from curated experimental data. GeneWeaver curators examine data and metadata.
Tier III
Human-Curated Data
Tier III data are directly entered or reviewed by a professional curator for redundancy with existing records and adherence to documentation standards. Users who submit data under Tier IV have the option of sharing their data to the public. These data will be marked provisional until reviewed by the curator for data entry errors, compliance to metadata standards and redundancy with existing data. The submitter of the data will have the opportunity to approve the curators modifications them prior to upgrade to Tier III status. For some research areas, a professional curator has identified and entered gene expression, quantitative trait locus and genomewide association studies (GWAS). Where possible, the curator has obtained results directly from the study authors, supplements, or data repositories such as GEO, in addition to the often highly-filtered set of results reported in publications.
Tier IV
Submitted to Public-Provisional
Tier IV consists of user submitted data that has been shared to the public prior to review. This data is indicated as provisional, but can be used in all analyses. Curatorial review is required to remove the provisional label.
Tier V
Private User and Group Data, Uncurated
Data in user accounts that is assigned private or group level access is confidential is not exposed to analyses by users outside of the group to whom it is shared, and is therefore not reviewed by the professional curator.

General Definitions

Gene Set Name: A brief title for the gene set, approximately sentence length, that should provide a clear and concise description of the contents of a gene set interpretable to most users of GeneWeaver, but with sufficient detail to satisfy a domain expert.  This is the major gene set name that is displayed in all search results, project directory and table views of analysis results. Standards for specific gene set types are given in the following section.

Gene Set Figure Label: A brief 23 character abbreviation to facilitate recognition of the gene set in a graph or other display.

Gene Set Description: A detailed description of the gene set, including rules for its construction, experimental methods and analyses used to generate data, anatomical terms, and traceable references to source data including accession information and date. Abbreviations should be avoided.

Ontology Annotations: Relevant terms from Disease Ontology, Mammalian Ontology and other OBO ontologies supplied by curators or identified through the application of the NCBO Annotator to textual descriptions including publication abstracts.

Publication Information: PubMed ID, Title, authors, publication information and full-text of the abstract.

Standards for Common Gene Set Types

Gene Expression

Gene Set Name: Genes [upregulated/downregulated/differentially expressed] in [tissue] [comparison]. Example: Genes differentially expressed in striatum of C57Bl/6J compared to C57Bl/6C. >Note: spell out anatomical terms as nouns, e.g. striatum, not striatal. Include complete strain names, e.g. C57BL/6J not B6.

Gene Set Figure Label: B6JvsB6CStriatum

Gene Set Description: Indicate which samples were compared. What experimental manipulations or tissue differences are being examined. Indicate statistical methodology, significance thresholds and which changes are reported here. Indicate if uploaded p-value, q-value, effect size or fold change and fold change reference. Example: Striatum gene expression differences between naive C57BL/6J and C57BL/6C substrains corresponding to a 5% FDR. A small number of genes are highly differentially expressed between B6 substrains, C57BL/6J (high alcohol consumption preference) and C57BL/6C (low alcohol consumption preference). Fold expression change are relative to B6/J.

'Gene Set Contents: Gene identifier and statistical score for differential expression, e.g. p-value, q-value, correlation coefficient, binary score, effect size or fold change.

Differential Expression Profiling

Gene Set Name: Description (name, Published QT Chr # MGI:#). Example: cocaine related behavior 10 (Cocrb10, Published QTL Chr #)

Gene Set Figure Label: (QTL-name-Organism-Chr #). Example: QTL-Cocrb10-Mouse-Chr 9

Gene Set Description: QTL Name Definition, candidate gene selection method (e.g. 1.5  LOD drop; inter-marker interval). Exact description of phenotype. Strains used for mapping should be included. Example: Rats were subjected to a forced swim test (FST) procedure in which they are placed in water for 5 min, and their behavior was scored every 5 sec as immobility, climbing, or swimming. Data were analyzed for each activity with consideration given to their non-independence. p-value:0.0002, Variance: 3.6, Peak Marker: D5Rat40 (BLAT 16538053) Spans 1-41538053. This interval was obtained by using a fixed interval width of 25 Mbp around the peak marker. Strains were WKY/NHsd and F344/NHsd. Also defined as Imm3.

Gene Set Contents: Gene identifier and binary score.

Published QTL Candidate Gene List

Gene Set Name: Describe tissue and phenotype correlated. Example: Cerebellum gene expression correlates of acetic acid writhing behavior in BXD recombinant inbred mice.

Gene Set Figure Label: Co-expression writhing

Gene Set Description: Indicate what the comparison was that was made and any statistical cut-offs that were used. Example: Cerebellum gene co-expression with acetic acid writhing in BXD RI mice. Gene expression data was obtained from genenetwork.org SJUT Cerebellum mRNA M430 (Mar05) RMA data set. Behavioral phenotype data was collected by RMQ and consisted of the number of writhes in response to 0.6% acetic acid i.p.

Gene Set Contents: Gene identifier and statistical score for co-expression. e.g. R-squared, p-value, q-value, binary threshold.

Co-Expression to Phenotype

Gene Set Name: Term # and name. Example: MP:XXXXXXX Abnormal.

Gene Set Figure Label: Term #. Example: Term #

Gene Set Description: Term Definition. Example: “Increase in the dose or concentration of a foreign compound required to induce a specific level of response” www.informatics.jax.org, 2010-12-01

Gene Set Contents: All gene sets include genes, mutant alleles or gene products annotated to an ontology term by a professional curator. Each gene directly annotated to the term is given a score of 1, each gene connected to a term through annotations to its higher order parents is given a score of 2. To use only direct annotations in an analysis assign a threshold of < 2 to each Gene Set.

Type of Data: Reference Ontology

Gene Set Name: Co-Expression clusters. Example: Co-expression cluster of nicotine Dependence genes significantly expressed in the adolescent PFC, VS and Hippocampus.

Gene Set Figure Label: Abbreviated description. Example: Adolesc Rat Nic Dependence

Gene Set Description: Indicate what samples were compared and what was clustered. Example: Studies analyzing brain samples from female rats that had been injected with nicotine at four different ages show that nicotine exerts the greatest influence during adolescence. Using DNA microarrays, gene expression correlates were obtained from the prefrontal cortex (PFC), ventral striatum (VS), and hippocampus. Principal cluster analysis was then used to identify 76 genes that changed significantly in at least one of these three brain regions during the experiment.

Gene Set Contents: Gene identifier and statistical score for cluster analysis or binary threshold.

Type of Data: Co-Expression Clusters

Gene Set Name: GWAS of ... Example: GWAS of Alcohol and Nicotine Dependence in Australian DNA-Pools.

Gene Set Figure Label: Abbreviated description. Example: GWAS Alcohol Nicotine

Gene Set Description: List of positional candidate genes after correcting for multiple testing and controlling the false discovery rate from genome wide association study. Represents genes associated with a linked cytological region or genes ‘near’ an associated SNP. Example: Genome-wide association study identifies a locus at 7p15.2 associated with endometriosis.

Gene Set Contents: Gene identifier and binary threshold.

Type of Data: Genome Wide Association Study

ToDo

Top ↑

Curation Guide

The Curation menu in GeneWeaver provides options for managing curation tasks and searching and assigning publications

[ 100px](images/Image001.png " 100px")

Managing Curation Tasks

[ 100px](images/Image002.png " 100px")

When selecting “Manage Curation Tasks” from the page menu you’ll be presented with a page containing in the side bar, all of the curation groups you belong to separated by groups you administer and groups of which you are just a member. The main body of the page will contain the list of curation tasks for the top group in the side bar. The curation tasks are a mix of publications and genesets, which have been assigned to this group, with the tasks, which have not yet been assigned to a curator, appearing at the top of the table.

[ 300px](images/Image004.png " 300px")

You can change the selected group in the main part of the page just by clicking on the group of interest in the side bar.

Immediately above the table, there are buttons which will allow you to filter the contents of the table to contain: All results, Assigned tasks, Unassigned tasks, tasks which are Ready for review and tasks which have been Reviewed. In this context Assigned and Unassigned are referring to curator assignment.

[ 300px](images/Image006.png " 300px")

The columns of the table are mostly self-explanatory, however it’s worth explaining PUB ASSIGNMENT and # GENESETS.

The PUB ASSIGNMENT column will display the associated PubMed ID for a geneset task, when it was entered via an association when a Publication Assignment. The link on the PubMed ID will take you to the publication assignments page.

The # GENESETS column indicates for a publication, how many genesets are associated with it as part of this specific publication assignement. If this publication is assigned to another curation group as well, genesets as part of that publication Assignment will not be part of this number.

If you are an administrator of the curation group for which you are managing tasks, there should also be an Assign Curator button at the top right of the page. You are able to select one or more task rows in the table, at which point they should be highlighted yellow.

[ 200px](images/Image008.png " 200px")

One note about how row selection works: There are no Shift or Control operations for selecting multiple rows. Rows are selected one at a time, and remain selected until you click on the row again, when it becomes deselected. Also, selections do not persist when you move to the next page of results. This latter issue is something we intend to address in a future release. However, for the time being it’s recommended you select the visible rows you would like to assign, assign them, and then move onto the next page of results.

Once you’ve chosen the tasks you want to assign (or reassign), you will select the Assign Curator button.

[ 300px](images/Image010.png " 300px")

You will then be presented with a modal dialog box, where you can select the individual you wish to curate the tasks, and include a note regarding the curation assignment.

Once a curator has been selected, click the Assign For Curation button. If you select Close instead no assignment will be made.

For your convenience, if you realized while in the Curation Task Management page that you want to assign a publication to this group, so that you can subsequently assign it to a curator, there is also an Add Publication button at the top of the page.

[ 300px](images/Image006.png " 300px")

This button will take you to the Search/Assign Publications page with only publication generators listed that were created for the curation group.

[ 300px](images/Image014.png " 300px")

Search/Assign Publications

[ 100px](images/Image003.png " 100px")

When selecting “Search/Assign Publications” from the page menu you’ll be presented with a page containing an “accordion” display, with the middle section opened by default. The assumption is that most times the user will be interested in generating a list of publications from which to make assignments.

[ 300px](images/Image017.png " 300px")

The section is broken into 3 parts:

  1. Single Publication Assignment
  2. Publication Generators
  3. Generated Publication Listing

Single Publication Assignment

If you select the + symbol next to Single Publication Assignment you will be presented with a simple search box. This would be used in the case where you have a specific PubMed ID that you know and want to assign for curation. You simply enter the PubMed ID and select the Find Publication button.

[ 300px](images/Image019.png " 300px")

Assuming you’ve entered a valid PubMed ID, the citation will be returned so that you can confirm that this is indeed your publication of interest.

[ 300px](images/Image021.png " 300px")

To assign the publication to a curation group to work on, just select the Assign To Curation Group button and you will be presented with the following modal dialog box displaying a drop down so you can select the curation group and a text box so that you can enter any curation notes you might have.

[ 300px](images/Image023.png " 300px")

Publication Generation

If you select the + symbol next to Publication Generators you will be presented with a table of generators that have been created for groups of which you are a member, and an Add Generator button.

[ 300px](images/Image017.png " 300px")

The columns of the table represent: the NAME that was assigned to the generator when it was queried; the PUBMED SEARCH term that is used to search PubMed and bring back a list of publications; FOR GROUP which is the curation group for which the generator was created; the date the generator was LAST RUN; and a series of ACTIONS which can be executed on a generator (will discuss these later).

In the case where there are no generators already created for any of the groups to which you belong, the first step would be to click Add Generator. This will bring up a modal dialog box

[ 300px](images/Image027.png " 300px")

You will be presented with three fields, which are all mandatory in order to have the Save button enabled. Generator Name is a self selected name to represent your generator. PubMed Query must be a valid PubMed search term. You can learn more about valid PubMed using the following YouTube video (<https://www.youtube.com/watch?v=dncRQ1cobdc&feature=relmfu>). There is also a link to the PubMed search string builder (<https://www.ncbi.nlm.nih.gov/pubmed/advanced>) directly in the dialog box.

[ 300px](images/Image029.png " 300px")

Once created the generator becomes available in the table of generators.

Generator Actions

There are three actions available to be used with generators:

We’ll discuss Run last as it’s most involved and leads to the next section.

15pxEdit is fairly straight forward. It presents you with a modal dialog identical to the one you get when creating a new generator. You are able to update any of name, search term or group association.

15px Delete will simply bring up a confirmation dialog box.

[ 300px](images/Image041.png " 300px")

15px Lastly the Run option will cause the generator to run against PubMed, automatically collapse the Publication Generators accordion section and will expand the Generated Publication Listing section, with the results of the generator displayed.

Generated Publication Listing

If you select the + symbol next to Generated Publication Listing you will be presented with a table of publications that have pulled from PubMed and are the result of the PubMed search term associated with a given generator. This section is populated by selecting the Run 15px icon in the generator table.

[ 300px](images/Image047.png " 300px")

Publications that are pulled by a publication generator are not persisted in the GeneWeaver database. At least, not until the time they are assigned to a curation group. Instead the publications that are not already assigned to a group are pulled directly from PubMed at the time of generation. Some of these queries can result in a very large number of publications (hundreds of thousands). Therefore we only display a slice of the publications at a time. We do keep track of the total number that match the search term, and allow you to page through the results, each time, going back out to PubMed to pull in the next set.

Similar to the Curation Task Management page, you can select multiple rows to be assigned to a curation group all at once. This is done by individually selecting each publication of interest. There are no features for multi select all at once using either the control or shift keys. The only way you can de-select a row, is by clicking the row again.

You can get more detail about a publication by clicking the + symbol at the beginning of the row. This will display the title, authors,journal and publication date, a link to the full text of the publication and the abstract.

[ 300px](images/Image049.png " 300px")

Once you’ve selected the publication or publications that you would like to assign to a curation group, you select the Assign to Curation Group button. This will bring up a modal dialog box where you will select a curation group, and optionally type in a note regarding the curation that is to be done.

[ 300px](images/Image051.png " 300px")

Once assigned the publications that have been assigned to a curation group should now have an View icon appearing at the end of the row, and if you cursor over the icon you will see a tool tip telling you what group or groups are curating this publication.

[ 300px](images/Image053.png " 300px")

Also, if you select the + symbol at the beginning of the row now, the groups will be listed under Assigned to Curation Groups under the expanded details.

Once an assignment has been done a notification will be sent to the administrator of the curation group so they know that there is a new publication that needs to be assigned to a curator. Notifications will be discussed in another section. If you now return the the Manage Curation Tasks page for the curation group to which the publication has been assigned, you should now see the publication listed at the top of the tasks table.

[ 300px](images/Image055.png " 300px")

Publication Curation Assignment

You can get to the Publication Curation Assignment page from the Curation Task Management page in one of two ways.

[ 300px](images/Image057.png " 300px")

If you select a publication that has not been assigned to a curator yet, you’ll get to a page that looks something like this:

[ 300px](images/Image058.png " 300px")

The citation information is present, and the curation group is identified, but there is no curator assigned and no associated genesets.

Assignment to a curator could have been done via the Curation Task Management page as detailed previously, or by using the Assign To Curator button on this page. The functionality of that button is essentially the same as on the other page, with an option to select a curator, and include a curation note.

Once the curator is assigned, the curator’s name and any notes that have been entered will appear in the upper right hand side of the page.

[ 200px](images/Image060.png " 200px")

As the assignee of a publication, you will be presented with an additional button below Save Notes to be used to Create New Geneset. The Reassign button that was visible to the administrator now becomes a Mark as Complete button.

[ 200px](images/Image062.png " 200px")

Clicking on the Create New Geneset button brings up a dialog that allows you to enter a “stub” for one or more new genesets. A stub is essentially a placeholder for a geneset that will be more completely populated later time. This gives a curator the ability to quickly create a bunch of stubs while reviewing an article without having to enter the full information for each.

[ 300px](images/Image064.png " 300px")

The curator can select the species of interest and then just enter the name, the label to be used in figures and a description. They can add multiple for this species by selecting Add Row, and when they’ve entered the information for all the geneset stubs associated with this species, they hit Submit.

When you’ve hit Submit, some automatic annotation of the geneset happens in the back ground. Your geneset stub will not immediately become visible under GeneSets Created For This Assignment. Instead you will see “loading…”. Once the geneset stubs are created the page will display the new geneset stubs.

[ 300px](images/Image066.png " 300px")

Once it’s loaded the geneset stub will appear under GeneSets Created For This Assignment. It make take a while for the new geneset stub(s) to appear in the list of genesets associated with the publication assignment, since GeneWeaver is calling out to an external text annotator to annotate the geneset description and publication abstract.

If there are other genesets visible to the user that are associated with this publication, but were not created through this publication assignment, then they will show up under Other Visible GeneSets Associated With This Publication.

[ 300px](images/Image068.png " 300px")

Once the geneset stubs have been created, the curator can click on the link for any one of the genesets, and begin curation of an actual geneset.

When curation of all of the associated genesets for this publication are complete, the curator should click the Mark as Complete button on the Publication Curation Assignment page.

Curation Page

The geneset curation page is essentially the standard view geneset details page with some of the features turned off. On this page the curator can add or remove genes from the geneset, set a threshold, edit meta content, or update the curation notes. Once the curator has finished editing the geneset they can mark is Ready for Review, which will send the geneset back to the group administrator for review. If the group has multiple administrators then the geneset will be sent to the administrator that assigned the curation task to the curator.

[ 300px](images/Curate_geneset.png " 300px")

Top ↑

Notifications

Notifications are the mechanism GeneWeaver uses to send messages within the application. There is also an option to receive email for notifications, which can be controlled from the Account Settings page.

[ 300px](images/Image072.png " 300px")

[ 300px](images/Image074.png " 300px")

Regardless of whether or not a user has been configured to receive emails, they will always receive messages through the Notifications page. The fact that you have pending notifications will be noted in the menu bar by a red indicator over the envelope icon.

[ 300px](images/Image076.png " 300px")

The Notifications page itself is fairly straight forward listing the notifications that have not yet been seen in bold, and the rest of the notifications in normal font. There is a button at the bottom of the page that allows you to Load More Notifications so that you can get your full history of notifications.

[ 300px](images/Image078.png " 300px")

Top ↑

Analysis Tools

GeneWeaver uses a set of analysis tools to operate on genes and gene sets. These tools evaluate a range of data inputs for the purposes of elucidating hierarchical relationships among a set of gene sets of interest. They can be used to visualize bipartite clusters HiSim Graph, or visualize genes with the more common intersections, GeneSet Graph.

Generation and visualization of a maximal triclique using the intersection of gene sets with the Triclique Viewer Tool can allow users to discover novel relationships between gene ontology terms. The overlap/similarity of gene sets, themselves can be visualized with Jaccard Similarity plots. These set overlaps are also available for Clustering, while component gene intersections can be found on our Gene Intersection Lists. The Boolean Algebra tool uses advanced set logic to integrate multiple genesets. For each tool, GeneWeaver allows users to expand their search beyond a single species using Homology Mapping.

Top ↑

HiSim Graph

About the HiSim Graph Tool

The HiSim Graph, short for Hierarchical Similarity Graph, is a tool for grouping functional genomic datasets based on the genes they contain. For example: The user may want to determine what a set of experiments on alcohol preference have in common, and what makes various experiments unique from one another. Alternatively, one may wish to take a large set of studies of related phenomena and identify their shared or distinct substrates. In this situation one may want to know whether there is a shared biological basis for addiction and learning, and if so, what the substrate is. The user might also want to examine studies of a large number of related disorders and determine whether a more appropriate biologically-based classification can be constructed.

The HiSim Graph Tool is designed to address these goals; it presents a tree of hierarchical relationships for a set of input GeneSets. The structure is determined solely from the gene overlaps of every combination of GeneSets.

Understanding the Results of the HiSim Graph

It's best to use the HiSim Graph Tool with a knowledge on what set intersections are: If GeneSet A contains Gene A, Gene B, and Gene C, and also GeneSet B contains Gene A, Gene B, and Gene D. Then the intersection of GeneSet A and GeneSet B will contain Gene A and Gene B, because an intersection of sets are whatever is contained in all sets intersected.

In terms of GeneSets, the smallest intersections (fewest GeneSets, most genes) are towards the right, and the largest intersections (most GeneSets, fewest genes) are on the left. When thinking about the genes in all the GeneSets, the roles are reversed (smallest number of genes on the left, largest number of genes on the right).

Figure 1: Relation of GeneSets to the HiSim Graph

HiSim Graphs must be interpreted in the context of the input GeneSets. The above example represents differentially expressed genes in multiple brain regions of alcohol preferring rats from a single study. The highest intersection represents a gene differentially expressed in all 5 brain regions. In this case, the highest intersection represents the highest amount of correspondence between data sets. As you move to the right, genes become more specific to the brain regions tested. Each solid node has children and can be collapsed by clicking on it. Leaf nodes are empty and colored by species, which is identified in a legend at the bottom of the screen.

Figure 2: A HiSIm Graph for diverse functions

If one were to start with multiple alcohol preference measures from different studies, the top of the HiSim Graph represents the correspondence between the experiments (such as well-characterized alcohol preference genes), and as you descend the graph the intersections describe more specific features shared between experiments (such as stress response or tissue source).

When starting with more loosely related inputs, interpretation becomes more difficult. If one started with alcohol preference, nicotine dependence, and traumatic brain injury data (Figure 2), the top of the HiSim Graph would represent more generic processes such as neural plasticity in this case.

Using the HiSim Graph Tool

Access the HiSim Graph Tool through the Analyze Genesets tab.

To generate a HiSim Graph, you must first select gene sets from a project. Projects may be created and updated by uploading GeneSets, searching the GeneWeaver database, or through the use of other tools in the GeneWeaver system. See the documentation for uploading GeneSets, Search, or Manage GeneSets to learn more about these functions. To select an entire project or multiple projects for analysis, check the box next to the project name. To select individual GeneSets within a project, click on the + beside the project name and check individual GeneSets using the check boxes. Next, click on the HiSim Graph icon in the Analysis tools box to the left of the project list. Select the options you would like for the tool to run on, and click Run.

Figure 3: Selecting gene sets and executing an analysis from the Analyze GeneSets page

Figure 4: The results page for the HiSim Graph. Most genes are connected to two of the input GeneSets. One gene is connected to three of the input sets. (Inset) The GeneSet Intersection page. GeneSet intersection data can be downloaded as a csv file for subsequent analyses. The GeneSets giving rise to each node can be stored in a separate project.

The HiSim Graph opens and the nodes can be selected to expand the graph. More details of each intersection can be viewed by clicking on the individual nodes in the tree. A link at the bottom of the frame allows download of the csv.

Figure 5: These options are available for the HiSim Graph, to change the way nodes interact with each other. The stats of the graph, as well as shortcuts and the legend identifying each species in the graph, are also displayed.

Figure 6. This shows the search function, which highlights paths between nodes containing the item searched for, whether it be gene, geneset, or species.

Options

There are a number of options available to optimize the HiSim Graph analyses. You may access the following options on the Analyze GeneSets page by clicking on the blue '+' symbol to the left of the HiSim Graph Tool.

DisableBootstrap

When the resulting HiSim Graph is unimaginably large, a bootstrapping filter is applied to reduce the output size. This step removes edges that are weakly supported by the underlying data, for example, those partitions of GeneSet subgroups that are driven by a single gene difference between the groups. If you would like the large, unfiltered graph instead, set this option to True to disable bootstrapping. Be warned this may stretch the graph's size.

Figure 6: A HiSim Graph with DisableBootstrap turned on (True).

Figure 7: A HiSim Graph with DisableBootstrap turned off (False).

Homology

Include homology to integrate multi-species data. This is done by using homologene mappings to relate identifiers across species. If homology is excluded, data from multiple species will be segregated into separate trees.

Figure 8: Homology excluded. A separate map is drawn for mouse, no overlap with human is allowed.

Figure 9: Homology included. GeneSets from mouse and human are allowed to be mixed and are intertwined as one.

MinGenes

The minimum number of genes for an intersection. The default of 1 means that all intersections will be displayed. Increasing the value means that intersections with fewer genes will not be displayed in the output, decreasing noise and displaying more robust correspondence between GeneSets. This generally has the effect of removing the topmost nodes.

Figure 10: As shown above, the left tree is with the default MinGenes = 1, the right tree is with the default MinGenes = 5.

Permutations

The HiSim Graph can ultimately address questions among highly curated data such as how much dimension reduction does gene overlap provide. For example, one may take a large set of gene sets associated with mood disorders and ask whether the data are similar enough to group together, i.e., of all possible subset intersections, how many are populated, and is this result better than chance?

The maximum number of permutations to run is set to 0 by default since it can take a long time to run for large input sets. The genes contained in each GeneSet are permuted over the union of all genes in the input sets, controlling for the size of each GeneSet. The permutation tests measure the likelihood of getting a similar tree structure (Parsimony) or of getting a similar aggregation of genes in each intersection Gene Aggregation). Note that this is a maximum value since the actual results may be fewer due to the time limit.

Parsimony is a simple measure of the percentage of observed intersections out of all possible intersections. This mathematically defined as:

Figure 11: For those that aren't aware of the mathematical implications of parsimony, think of it as one of the many measures of accuracy for a map. You want more parsimony, but you can't always get full parsimony.

Gene Aggregation is a measure of the total node/tree probability. Each node is scored based on the intersection of genes and gene sets. Then the product of these scores is used to assign an overall tree aggregation probability:

Figure 12: Aggregation is another measure of accuracy that balances with parsimony in this tool, neither are ever fully accurate alone, but together they are more fine-tuned.

PermutationTimeLimit

The maximum amount of time to spend doing permutations. For example, if Permutations is set to 100,000 and this value is 5 minutes, the result with either have 100,000 permutations (if they finished within 5 minutes), or will be truncated to the number of permutations which were able to finish within 5 minutes. The more time you give to PermutationTimeLimit, the more accurate your results will be.

Top ↑

GeneSet Graph

Why Use the GeneSet Graph Tool

The GeneSet Graph is designed for the user in need of a partitioned display to illustrate just how tied genes are to one another. For example: a user in need of a GeneSet Graph would look for visual references more than chemical references or references by utility. A GeneSet Graph can also help pick apart the most valuable or most occurring genes depending on the user's preference.

Understanding the GeneSet Graph Tool

The GeneSet Graph Tool presents a partitioned display of genes and GeneSets. Genes are represented by elliptical nodes, and GeneSets are represented by boxes. The least-connected genes are displayed on the left, followed by the GeneSets, then the more-connected genes in increasing order to the right. Genes and GeneSets are connected by colored lines to show what genes are in which GeneSets. In this way, the GeneSet Graph displays the bipartite graph of the genes and GeneSets, but modifies the display of the gene partition to make it easier to visually interpret.

Figure 1: Least connected genes to the left, GeneSets in the middle, most connected genes on the right.

Using the GeneSet Graph Tool

Access the GeneSet Graph Tool through the My Projects tab under the Analyze Genesets option.

To generate a GeneSet Graph, you must first select GeneSets from a project. Projects may be created and updated by uploading GeneSets, searching the GeneWeaver database, or through the use of other tools in the ODE system. See the documentation for uploading GeneSets, Search, or Manage GeneSets to learn more about these functions. From the Analyze GeneSets tab, select “My Projects”. To select an entire project or multiple projects for analysis, check the box next to the project name. To select individual GeneSets within a project, click on the ‘!’ beside the project name and check individual GeneSets using the checkboxes. Next, click on the GeneSet Graph icon in the Analysis tools box to the left of the project list. (For users that want to change options, press the green + sign before they start the tool).

Figure 2: The GeneSet Graph can be interactively panned and zoomed with the mouse, and more details of each gene or GeneSet can be viewed by clicking on the individual nodes in the display. In addition to these interactive features, there are also a few options available to optimize the display.

Figure 3: Clicking on a gene node executes a search for other GeneSets containing the gene of interest or its homologues. Clicking on a GeneSet node reveals full publication and annotation information, including the GeneSet description.

Options

SuppressDisconnected

When enabled, this option will suppress the display of GeneSets which are not connected to any displayed genes. This help remove unnecessary information for users that only want relations. This is only relevant when MinDegree is greater than 1.

Homology

Include homology to integrate multi-species data. If excluded, data from multiple species will be segregated into distinctly separate graphs.

Figure 4: 2 GeneSets each from mouse and rat.

MinDegree

The minimum number of connections for a displayed gene. A value of 2 means that any displayed genes must be found in at least two of the input gene sets. Increasing this value will basically shift the resulting gene display left. Since lower-order overlaps are generally more likely and more numerous than higher-order intersections, this can quickly reduce the number of genes displayed and make the result more manageable.

Figure 5

Top ↑

Jaccard Similarity

Why Use the Jaccard Similarity Tool

The Jaccard Similarity Tool displays a matrix of Venn diagrams, which can be very useful for quickly finding overlapping GeneSets and evaluating the similarity of results across a collection of experiments. This snapshot may enable you to determine which can be removed or kept for more complex comparison analysis (such as the HiSim Graph.

Understanding the Jaccard Similarity Tool

Each Venn Diagram represents the pairwise gene overlap between the two GeneSets depicted for each row and column. Text overlays show the exact gene counts, Jaccard Similarity coefficient and p-value for every pair. The p-value is calculated based on the cumulative probability of obtaining a Jaccard coefficient greater than or equal to the observed value, using formula [17] in Real and Vargas, 1996.

For those less knowledgeable of Jaccard Similarity, it's the ratio of elements in both sets over the elements only found in separate sets. If your matrix produces two separate blue and red circles, rather than a touching Venn Diagram, it means nothing is alike in either of those two GeneSets.

Jaccard Similarity Equation - source

Jaccard Similarity Equation - source

Background Processes

The Jaccard Similarity Tool now implements the calculation of the p-value for the Jaccard Similarity score based on an empirical sampling distribution. The distribution is approximated for each unique gene set cardinality (gene set size) pair. Each unique pair of cardinalities are randomly sampled (10,000 samples) from the actual gene list of the geneweaver database and plotted based on the frequency of Jaccard Similarity. The result is a Frequency versus Jaccard Similarity histogram that is used as the distribution for the calculation of the p-value. To calculate the p-value, the tool will simply compare the Jaccard Similarity of the user-selected gene set and grade it based on the curve stored in the database.

If the Jaccard Similarity does not exist in the curve - that is, if the Similarity is too high to occur randomly - the p-value is simply zero. If the Jaccard Similarity were to have a value of 1, this would indicate that the gene sets are either one is a subset or both are identical. In this case, we assign a special p-value of 1* since we agree that the probability of a set matching itself (and not some other set which contains other genes) will always occur.

The implementation of this process is coded and optimized for C++ which runs in the background as your results are loading onto the next page.

Using the Jaccard Similarity Tool

Access the Jaccard Similarity Tool through the My Projects tab under the Analyze Genesets option.

To generate a Jaccard Similarity Matrix, you must first select gene sets from a project. Projects may be created and updated by uploading Gene Sets, searching the GeneWeaver database, or through the use of other tools in the GeneWeaver system. See the documentation for uploading GeneSets, Search, or Manage GeneSets to learn more about these functions. From the Analyze GeneSets tab, select My Projects. To select an entire project or multiple projects for analysis, check the box next to the project name. To select individual GeneSets within a project, click on the + beside the project name and check individual gene sets using the check boxes. Next, click on the Jaccard Similarity icon in the Analysis tools box to the left of the project list.

Figure 1: Once you have selected GeneSets from a project, select the Jaccard Similarity icon from the Analysis Tools box, to the left of your GeneSets.

Tool results are displayed as a grid of proportional overlaps. The grid, itself, is written in d3 for dynamic user interaction.

Figure 3: Venn diagram for 9 GeneSets. The detail below highlights Column 3, Row 2.

Jaccard Overlap
GS row = pink circle (left)
GS column = green circle (right)
J = Jaccard coefficient
p = p-value
Green circles show emphasis genes

The resulting matrix can be zoomed in and out by scrolling the mouse up and down. There is a reset zoom button just in case the user's place is lost in the matrix of venn diagrams. The user can also click and In addition to these interactive features, the gene sets can be highlighted by row and column by shift+clicking on the intersection of two gene sets.

Figure 6: Highlight of row 2, column 3

Figure 6: Highlight of row 2, column 3

The gene sets can be deselected by alt+clicking on any highlighted gene set.

Rerun Option

The user is able to rerun the tool with different parameters with the rerun tool options.

Figure 7: Rerun tool option

This option is expand/collapsable by simply clicking on the Rerun Tool Options text.

Geneset Panel

The geneset panel shows the Jaccard coefficients and the p-values for every geneset pair for the project the user has chosen. The geneset panel does not recieve the same reduction as the venn diagram as it would be helpful to still view every geneset pairing for convenience.

The user may also click the checkboxes located next to the geneset names for them to add those selected genesets to a project or to export the genes.

Figure 2: Click Run to produce Jaccard Similarity Results for your selected GeneSets. Text overlays show the exact gene counts, Jaccard Similarity coefficient and p-value for every pair.

Options

Homology

Include homology in order to integrate multi-species data. If excluded, homologous genes from different species will not be counted as intersecting. Data from separate species will never show an overlap without homology.

PairwiseDeletion

Pairwise Deletion is used to pick off problematic missing values from data while still aiming to get the remaining values for comparison-based use:

Values Obj1 Obj2 Obj3
Length 23 N/A 13
Width 21 22 14
Depth N/A 20 11

Figure 7: In Pairwise Deletion, when comparing length, only Obj1 and Obj3 will be compared. When comparing width, all will be compared, and when comparing depth, only Obj2 and Obj3 can be looked at. This prevents missing data from being assigned a default value such as 0 in the system.

Top ↑

Clustering

Motivation

Clustering is one of the most powerful tools in bioinformatics, where classifications are too strict for data distinction, clustering helps give the user an evaluation that is not so distinct.

User Guide

Using the Tool

  1. Select the gene sets from your list of projects that you would like to analyze.
  2. Select if homology is to be included or excluded.
  3. Select the method of clustering.

Understanding your Results

Visualization Types

There are two methods for visualizing your clustering results.

Force Directed Graph

Partitioned Sunburst

Clustering Methods

Listed below are the six different methods that the user can choose from while running the tool. The first five are different clustering methods that will run on the selected genesets and display a force directed tree and a partitioned sunburst based on the clustered genesets.

All five of the given clustering methods are agglomerative hierarchical clustering methods that start with each geneset belonging to its own cluster. They then combine the clusters at each iteration based off of a described linkage method that determines how the distance between two clusters is defined. The clusters are combined until there are no more clusters that are similar to each other (the distance between them is too large).

McQuitty

The McQuitty clustering method uses a linkage method where distance depends on the combination of clusters instead of the individual genesets within each cluster. When two clusters are joined together, the distance of the new cluster to any other cluster is calculated as the average distance between the two clusters that are being joined and the other cluster. For example, if clusters 2 and 4 have the greatest similarity and we are going to combine them into a new cluster called 2+4, then the distance from 2+4 to 1 is the average of the distances from 2 to 1 and 4 to 1.

Ward

The Ward clustering method uses a linkage method where the distance between two clusters is based off of the Jaccard Similarity score between them. When two clusters are joined together, the new cluster will take the union of the genesets in the two clusters that are being joined and set that as its geneset. It will then calculate the new geneset's similarity score against all the other cluster's genesets and that will be set as the distance between the new cluster and all the other clusters.

Complete

The Complete clustering method uses a linkage method where the distance between two clusters is the lowest similarity score between any of the genesets in one cluster compared to any of the genesets in the other cluster. When two clusters are combined, the genesets within each of the clusters are put into a new cluster. No new calculations are needed at each iteration because we are simply reusing the similarity scores of all the genesets compared to each other.

Average

The Average clustering method uses a linkage method where the distance between two clusters is the average similarity score between all of the genesets in one cluster compared to all of the genesets in the other cluster. When two clusters are combined, the genesets within each of the clusters are put into a new cluster. No new calculations are needed at each iteration because we are simply reusing the similarity scores of all the genesets compared to each other.

Single

The Single clustering method uses a linkage method where the distance between two clusters is the highest similarity score between any of the genesets in one cluster compared to any of the genesets in the other cluster. When two clusters are combined, the genesets within each of the clusters are put into a new cluster. No new calculations are needed at each iteration because we are simply reusing the similarity scores of all the genesets compared to each other.

Top ↑

DBSCAN Gene Clustering

What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Application with Noise) is a clustering algorithm that groups genes into clusters based on how closely related the genes are.

Why Use the DBSCAN Tool?

In general, clustering is used to find patterns or outliers within data sets. In this implementation of DBSCAN, genes in the same cluster would be considered similar, while genes in different clusters would be less similar. An explanation of DBSCAN can be found here. Within Geneweaver, this tool can be used to infer relationships between genes. For example, if clusters with similar genes continue to appear in tests across multiple data sets, one could say that these genes are closely related.

DBSCAN Parameters

DBSCAN takes in 2 parameters, epsilon and minPoints.

The Epsilon Parameter

Epsilon determines how close the genes need to be in order to be considered in the same cluster. For example, an epsilon of 1 means that genes need to share at least 1 gene set. Another way of describing epsilon would be the "radius of the neighborhood". A larger epsilon will have a farther reach when finding clusters.

The minPoints Parameter

The minPoints parameter determines the minimum number of points required to form a cluster. A cluster can have more than the minPoints number of genes, but cannot be less than minPoints. If a cluster has less than minPoints number of genes, it is considered noise.

The DBSCAN Algorithm

Before the DBSCAN algorithm executes, it must determine how closely related each gene is to the other genes. A bipartite graph is used to show how the genes connect to each gene set. First, all closest paths between genes are found. Following that, the DBSCAN algorithm is run. You can find an example of DBSCAN here.

Run Times of DBSCAN

On average, the worst-case time complexity of DBSCAN is O(n2). However, due to the sheer variability of data sets and epsilon and minPoints combinations, it is difficult to accurately predict the run time of this implementation. There are some factors that will typically increase the run time. These include:

Note: Even if no clusters are found, the algorithm may still take time to execute.

Below is a graph that shows the run times of the algorithm. The red line shows the run time if all genes are in the same gene set. The blue line shows the genes divided into 10 gene sets, with no overlap. The green line is similar to the blue line, but here the gene sets share one gene in common with one other gene set. This results in one giant cluster with all of the genes.

Note: Since the blue line and green line overlap, you may not be able to see the blue line.

Below is a table that estimates the run time of the red, blue, and green cases based on number of genes. Note that run times will change based on density of the gene sets and epsilon.

Number of Genes 1 Gene Set 10 Gene Sets, No Overlap 10 Gene Sets, Overlap
100 3 3 3
200 3 3 3
500 5 3 3
1,000 10 3 3
1,500 12 3 3
2,000 15 3 3
2,500 28 5 5
3,000 63 8 8
3,500 110 12 12
4,000 160 17 18
4,500 230 24 25
5,000 306 32 33
6,000 487 50 51
7,000 708 72 75
8,000 969 98 100
9,000 1270 129 131
10,000 1612 163 165

Approximate DBSCAN Run Times with Epsilon = 1 and Min Points = 1 (in seconds)

Visualization

Once DBSCAN is completed, results can be visualized in two ways. However, there is a possibility that visualization may not occur. If a data set is too large, the results will not be visualized and a message will be displayed.

Note: Due to the rendering of the Cluster / Gene Table, run times may appear longer than estimated in here.

Circles

The default visualization on the tool is circle packing. This represents the clusters and the genes within them. The outermost circle is the entire data set. The darker blue circles within represent the different clusters. The circles within the clusters represent the genes that belong to the cluster. The color of each gene denotes the species.

To see more information about the cluster, you can click on the cluster. This will zoom in on the cluster and display gene IDs. Clicking on a gene ID will redirect to a search for that gene within the GeneWeaver database.

Below is an example of the circle packing visualization with zoom functionality.

Wires

The other visualization is a wire representation. This shows the connections between all genes in the same gene set. The color of each gene shows which cluster the gene is in. If a gene is grey, it is considered noise. Mousing over a circle will highlight it and show the gene ID. By clicking and and holding a gene, you can drag the gene around the screen.

Note: This visualization will only be drawn with small data sets due to the complexity of drawing all lines between genes.

Below is an example of the wires visualization.

Cluster / Gene Table

Below the visualizations is a table. This table is split up into clusters, which contains all the genes within that specific cluster. Information about each gene can be seen here as well. This table is similar to the one on the GeneSet Details page.

If the data set becomes sufficiently large, a minimized table will be shown on screen. An example of the minimized table is below.

DBSCAN Example

Below is an example of the DBSCAN algorithm. For this example, epsilon is set to 1 and min-points is set to 4. Figure 1 shows the gene-to-gene set bipartite graph.

Figure 1: The gene-to-gene set bipartite graph

Finding Shortest Paths Between Genes

Starting at "Test Set 0" Prp31, Arr1, baz, and car are all in the same gene set. This means that when building the gene-to-gene graph, all of those genes will be connected to each other. "Test Set 1" shows that Arr1 and veli are connected. "Test Set 2"has veli and Arr2 connected. "Test Set 3" has Arr2 connected to CalX. Finally, "Test Set 4" has CalX, CdsA, and Cerk connected. Now that the connections between genes are determined, a map can be drawn showing these connections (Figure 2).

Figure 2: The gene-to-gene graph denoting shortest paths

Using this graph, the shortest path from a gene to any other gene can be determined. For example, the distance between Arr1 and baz is 1. The distance between Prp31 and CalX is 4. This is important when applying epsilon to the algorithm.

Running the DBSCAN Algorithm

On the right is the pseudocode for the algorithm.

DBSCAN Pseudocode

DBSCAN Pseudocode

Starting in the DBSCAN function, the cluster is first initialized to 0. Next, each point is visited only once. For this example, baz will be the first gene visited. baz will be first be marked as visited, then the neighbors of baz will be found by regionQuery. The regionQuery function will return all points within radius epsilon, including the point itself. Calling regionQuery on baz with epsilon will return all genes that are one away from baz. In this example baz, car, Prp31, and Arr1 are returned and listed as baz's neighbors.

The list of [baz, car, Prp31, Arr1] are returned. Now the amount of items in the list is checked with the minPoints parameter. If it is greater than or equal to minPoints, a cluster is formed. Otherwise, the point is labelled as noise. In this example, baz has 4 neighbors, which is equal to the number of points. The "C = next cluster" statement means that C is a valid cluster. Next, the expandCluster function is called.

The expandCluster will continue to expand the cluster until the edge of the cluster is reached. The edge of a cluster is reached when a point has a list of neighbors that is less than the number of minPoints. When entering the expandCluster function, the point P will be added to the cluster. The cluster is currently [baz]. Next, the algorithm runs through all of the neighbors to see if the cluster can be expanded. The list of neighbor points is now [baz, car, Prp31, Arr1]. First baz is looked at, but because it has already been visited, it is not going to be checked again. Next, car is checked. Car will then return a list of all its neighbors, which are [car, baz, Prp31, Arr1]. Then that list is checked against the number of minPoints. Since it is greater than or equal to minPoints, that list is added to the original list of neighbors. So the original neighbors list of [baz, car, Prp31, Arr1] and the new neighbors list of [car, baz, Prp31, Arr1] are added together. However, the algorithm does not add duplicate genes to the list. Therefore, nothing is added to the list and the neighbors list is [baz, car, Prp31, Arr1]. Then, the gene is added to the current cluster if it is not already part of a cluster. car is not a part of any other cluster so it is added to the current cluster. Now the cluster contains [baz, car].

Next, Prp31 is looked at. Its neighbors are [baz, car, Prp31, Arr1]. This list is equal to minPoints, but once again, the list of Prp31's neighbors are already in the list of baz's neighbors. So nothing is added to new neighbors, and since Prp31 is not a part of any other cluster, it is added to the current cluster, which is now [baz, car, Prp31].

Now, Arr1 is looked at. Its neighbors are [Arr1, baz, car, Prp31, veli]. Notice that a new gene appeared in Arr1's neighbors (veli). This gene is now added to the list of baz's neighbors. Arr1 is added to the current cluster, so the cluster now holds [baz, car, Prp31, Arr1]. Now there is still one gene left to check in baz's neighbors, which is veli.

veli is checked and it's neighbors are [veli, Arr1, Arr2]. The list is less than the number of minPoints, which means the cluster cannot be expanded past veli.

However, veli is still part of the current cluster. The current cluster is now [baz, car, Prp31, Arr1, veli]. Since the list of baz's neighbors have all been checked, the cluster is finished.

Now that baz has been checked, it is time to check other genes. Next, car is checked. However, it was already visited when handling baz's neighbors, so nothing needs to be checked. The same applies for Prp31, Arr1, and veli. The next gene to check is Arr2. Arr2's neighbors are [veli, Arr2, CalX]. This is less than minPoints, so it is marked as noise.

However, just because a gene is marked is noise, does not guarantee it is noise when the algorithm is finished. Later in the algorithm, it can be added to a cluster.

Next, CalX is checked. It's neighbors are [CalX, Arr2, CdsA, Cerk]. This list is equal to minPoints, so the cluster needs to be expanded.

CalX is checked, but it is already visited, and it is not a part of any cluster, so it is added to the 2nd cluster. The 2nd cluster currently holds [CalX]. Next, Arr2 is checked, but it was already visited and marked as noise. However, it is not in any cluster, so it is added to the 2nd cluster. The 2nd cluster now contains [CalX, Arr2]. Next, CdsA is checked. Its neighbors are [CdsA, Cerk, CalX]. This list is not greater than minPoints so nothing is added. CdsA is not added to the 2nd cluster because it is not part of the first cluster. The 2nd cluster is now [CalX, Arr2, CdsA]. Finally, Cerk is checked. Its neighbors are [CdsA, CalX]. The list is smaller than minPoints, so they are not added to Calx's neighbors. Cerk is not a part of any cluster, so it is added to the 2nd cluster. The 2nd cluster is now complete. It contains [CalX, Arr2, CdsA, Cerk].

Now that CalX is checked, CdsA is checked. It was already visited in the expandCluster function so nothing needs to be done. The same applies for Cerk. The algorithm is now complete.

Two clusters were produced: [baz, car, Prp31, Arr1, Veli] and [Arr2, CalX, CdsA, Cerk]

Figure 3 shows the gene-to-gene map visualized in clusters.

Figure 3: The result of the DBSCAN clustering

Top ↑

MSET

About MSET

MSET (Modular Single-Set Enrichment Test) is an enrichment test that tests a given geneset for enrichment against a particular collection of genes (known as the background) and identifies potential genes for use in future studies. Enrichment testing with MSET involves randomly sampling with replacement from the background to create a collection of simulated "top genes" obtained by chance. Using this collection of simulations, MSET then determines the p-value of the given geneset and identifies potentially interesting genes.

Why MSET?

MSET permits the selection, or customization, of the genes against which enrichment is performed. This yields the ability to perform more focusedvhypothesis testing relative to other enrichment tests. For example, genes specific to Alzheimer's may be selected to serve as the genes of interest against which enrichment testing is performed.

How Does MSET Work?

MSET performs enrichment testing using three items: the background, the top genes, and the genes of interest.

MSET then takes the following steps:

  1. MSET calculates the intersect size, or the number of genes shared, between the top genes and the genes of interest.
  2. MSET samples randomly with replacement from the background to generate a simulation of top genes x times. This generates x simulations.
  3. The intersect size between each simulation and the genes of interest is calculated and the number of simulations with an intersect size greater than or equal to that between the top genes and genes of interest is counted.
  4. The p-value of the top genes is calculated using the count of simulations from the previous step where the total number of simulations counted is divided by x (the total number of simulations generated).

An Example

The example below illustrates the process of MSET when generating 10 simulations.

Given the following:

MSET first calculates the intersect size of the top genes with the genes of interest.

Figure 1: The intersection of the top genes wit the genes of interest.

Since the top genes and the genes of interest share the gene j, the intersect size is determined to be 1.

MSET then samples randomly with replacement from the background to generate 10 top gene simulations.

Figure 2: MSET samples from the background to produce 10 sample top genes.

Simulated top genes:

  1. [b, d]
  2. [j, j]
  3. [k, a]
  4. [c, e]
  5. [g, g]
  6. [a, i]
  7. [b, g]
  8. [f, g]
  9. [c, d]
  10. [k, k]

From the simulated top genes above, MSET calculates the number of simulations which have an intersect size with the genes of interest which is at least that of the top genes with the genes of interest.

Figure 3: The intersections of each sample with the genes of interest.

Points to note:

Since #2 shares gene j with the top genes of interest and gene j occurs two times in the simulation, it has an intersect size of 2. Additionally, simulations without any genes shared with the genes of interest have an intersect size of 0 and are not included in MSET’s calculation for the p-value.

Once MSET is finished with its calculation, it uses the results of the calculation to determine the p-value of the top genes using the following equation:

           # of simulations with intersect size ≥ intersect size of top genes  
p-value = ------------------------------------------------------------------
                        # of simulations generated

where intersect size refers to the size of intersection with the genes of interest.

Using MSET

Access the MSET Tool through the Analyze Genesets tab.

To analyze your genes, select two projects. One containing the genes to be analyzed, and the other containing the genes of interest. Projects may be created and updated by uploading GeneSets, searching the GeneWeaver database, or through the use of other tools in the GeneWeaver system. See the documentation for uploading GeneSets, Search, or Manage GeneSets to learn more about these functions. To select an entire project or multiple projects for analysis, check the box next to the project name.

Figure 4: The Analyze Genesets page.

Next, click on the MSET icon in the Analysis tools box to the left of the project list and specify which project will serve as the Top Genes and which project will serve as the Interest Genes (genes of interest). Select the options you would like the tool to run on. Click Run to begin analysis.

Note: The background option specifies the attribution or gene database type, while the species options designates the species to be pulled from the GeneWeaver database. These options are combined to select the background that MSET uses for sampling from a number of previously generated backgrounds based on various combinations of the previous options.

Figure 5: Selecting the projects and options for MSET to run on.

Once the tool has completed analysis, you will be directed to the results page where you may view the probability distribution graph of all simulations generated, the size comparison graph of the genes of interest vs. the background, as well as any interesting genes that the tool has detected.

Figure 6: Viewing the MSET Results page.

You may also rerun the tool using the Tool Options section located below the listing of interesting genes.

Figure 7: The menu for rerunning the tool on the results page.

Top ↑

API

Geneweaver API Request Formats

The Geneweaver API can be accessed through the following address: https://geneweaver.org/api/

Definitions

Term Definition
Output The output of every call to the Geneweaver API will be in JSON format. An example of JSON can be viewed here: http://json.org/example.
API Key The Geneweaver API makes use of api keys to identify users and determine permissions they have when executing api calls. For example, to determine if a user has permission to view a private gene set they must identify themselves via their unique api key. In place of an api key, the “guest” key may be used instead; however, this will limit the user to public data only. A user may request an API key by creating an account on Geneweaver and asking for an API key on the account management page.
<apiKey> A unique identifier for a user (see API Key)
<ReferenceID> A string representing the gene ID
<GeneDatabase> A string with the Database Name corresponding to the gene ID
homology Optional addition that will return homologous genes
<GeneSetID> A positive integer value representing a gene set ID
<GeneID> A positive integer value representing a gene ID
<ProjectID> A positive integer value representing a project ID
<PlatformID> A positive integer value representing a platform ID
<PublicationID> A positive integer value representing a publication ID
<SpeciesID> A positive integer value representing a species ID
<DatabaseID> A positive integer value representing a gene database ID
<Project_Name> A string representing the name of an existing or new project
<TaskID> A unique identifier for a task returned by a tool
<FileType> The file type you wish to get (see specific tool for available file types)

Data Calls

This section outlines the individual calls that are available from the Geneweaver API.

Get Gene Sets by Gene Reference ID: This call returns all gene sets that contain the specified gene. The added homology parameter will return all gene sets that contain homologous genes as well.

/api/get/geneset/bygeneid/<apiKey>/<ReferenceID>/<GeneDatabase>/homology

Sample Call: https://geneweaver.org/api/get/geneset/bygeneid/Fw7J4GeAXE8CMVvLTKyrtBDk/RGD2561/RGD/homology

Get Gene Set by Gene Set ID: This call returns all information about a specified gene set given that gene set ID.

/api/get/geneset/byid/<GeneSetID>/

Sample Call: https://geneweaver.org/api/get/geneset/byid/220592/

Get Gene Set by User: This call returns all gene sets owned by the specified user

/api/get/geneset/byuser/<apikey>/

Sample Call: https://geneweaver.org/api/get/geneset/byuser/Fw7J4GeAXE8CMVvLTKyrtBDk/

Get Genes by Gene Set ID: This call returns all genes belonging to a given gene set.

/api/get/genes/bygenesetid/<GeneSetID>/

Sample Call: https://geneweaver.org/api/get/genes/bygenesetid/220592/

Get Gene by Gene ID: This call returns all information about a specified gene given a ODE gene ID.

/api/get/gene/bygeneid/<GeneID>/

Sample Call: https://geneweaver.org/api/get/gene/bygeneid/8/

Get Geneset by Project ID: This call returns all genesets associated with a project given a project ID.

/api/get/geneset/byprojectid/<apikey>/<ProjectID>/

Sample Call: https://geneweaver.org/api/get/geneset/byprojectid/Fw7J4GeAXE8CMVvLTKyrtBDk/2404/

Get Geneset by Geneset ID: This call returns all the information about a given geneset given its geneset ID

/api/get/geneset/bygenesetid/<GeneSetID>/

Sample Call: https://geneweaver.org/api/get/geneset/bygenesetid/8/

Get Projects by User: Returns all the projects that are owned by a given user.

/api/get/project/byuser/<apikey>/

Sample Call: https://geneweaver.org/api/get/project/byuser/Fw7J4GeAXE8CMVvLTKyrtBDk/

Get Ontologies by Geneset ID: Returns all the Ontology annotations associated with a geneset.

/api/get/ontologies/bygeneset/<apikey>/<GeneSetID>/

Sample Call: https://geneweaver.org/api/get/ontologies/bygeneset/Fw7J4GeAXE8CMVvLTKyrtBDk/8/

Get Probes by Gene ID: Returns all the probes associated with a gene.

/api/get/probes/bygeneid/<apikey>/<ReferenceID>/

Sample Call: https://geneweaver.org/api/get/probes/bygeneid/Fw7J4GeAXE8CMVvLTKyrtBDk/RGD2561/

Get Platform by Platform ID: Returns the platform associated with a platform ID.

/api/get/platform/byid/<apikey>/<PlatformID>/

Sample Call: https://geneweaver.org/api/get/platform/byid/Fw7J4GeAXE8CMVvLTKyrtBDk/3/

Get SNP by Gene ID: Returns all the SNPs associated with a gene (provided SNPs are loaded in the GW DB).

/api/get/snp/bygeneid/<apikey>/<ReferenceID>/

Sample Call: https://geneweaver.org/api/get/snp/bygeneid/Fw7J4GeAXE8CMVvLTKyrtBDk/RGD2561/

Get Publication by Publication ID: Returns all the publication data for given publication ID.

/api/get/publication/byid/<apikey>/<PublicationID>/

Sample Call: https://geneweaver.org/api/get/publication/byid/Fw7J4GeAXE8CMVvLTKyrtBDk/26/

Get Species by Species ID: Returns all the species information given a species ID.

/api/get/species/byid/<apikey>/<SpeciesID>/

Sample Call: https://geneweaver.org/api/get/species/byid/Fw7J4GeAXE8CMVvLTKyrtBDk/4/

Get Gene Database by Database ID: Returns information on a gene database given a database ID.

/api/get/genedatabase/byid/<apikey>/<DatabaseID>/

Sample Call: https://geneweaver.org/api/get/genedatabase/byid/Fw7J4GeAXE8CMVvLTKyrtBDk/7/

Create Project: Creates a project for the user and returns the project id that was just created.

/api/add/project/byuser/<apikey>/<Project_Name>/

Sample Call: https://geneweaver.org/api/add/project/byuser/Fw7J4GeAXE8CMVvLTKyrtBDk/myNewProject/

Add GeneSet To Project: Adds an existing gene set to a project you own

/api/add/geneset/toproject/<apikey>/<ProjectID>/<GeneSetID>/

Sample Call: https://geneweaver.org/api/add/geneset/toproject/Fw7J4GeAXE8CMVvLTKyrtBDk/3323/86676/

Remove GeneSet From Project: Removes a gene set from a project you own.

/api/Delete/geneset/fromproject/<apikey>/<ProjectID>/<GeneSetID>/

Sample Call: https://geneweaver.org/api/delete/geneset/fromproject/Fw7J4GeAXE8CMVvLTKyrtBDk/3323/86676/

Tool Output Calls

This section is dedicated to calling the GeneWeaver tools via the api. Tools are called by their separate api URLs. This will initiate the tool to run. The tools will return a task ID. Then the getStatus api call may be made to determine if the tool has finished processing your request given a task id. Once complete the finished data may be retrieved via the getFile api call using a task id.

For ALL tools, any of the parameters may be substituted with Default to use the default values.

Get Status of Tool Job: This api call will return the status of a job given its unique task ID.

/api/tool/get/status/<TaskID>/

This will return one of the following:

Sample Call: https://geneweaver.org/api/tool/get/status/c0bdc0e4-3e23-4273-aeeb-21539e60c53d/

Get Results Link: This api call will return a url that can be called to access a file requested by the user if the user has permission to access that file. This is useful if you wish to store a quicker method of repeat access to a file.

/api/tool/get/link/<apikey>/<TaskID>/<FileType>/

Sample Call: https://geneweaver.org/api/tool/get/link/Fw7J4GeAXE8CMVvLTKyrtBDk/c0bdc0e4-3e23-4273-aeeb-21539e60c53d/pdf/

Get Results File: This api call will return the file requested by the user if the user has permission to access that file.

/api/tool/get/file/<apikey>/<TaskID>/<FileType>/

Sample Call: https://geneweaver.org/api/tool/get/file/Fw7J4GeAXE8CMVvLTKyrtBDk/c0bdc0e4-3e23-4273-aeeb-21539e60c53d/pdf/

Get Results by User: Returns all the tasks ids run by the user

/api/get/results/byuser/<apikey>/

Sample Call: https://geneweaver.org/api/get/results/byuser/Fw7J4GeAXE8CMVvLTKyrtBDk/

Get Results by Task ID: Returns all the information about a given tool run given a task ID

/api/get/result/bytaskid/<apikey>/<TaskID>/

Sample Call: https://geneweaver.org/api/get/result/bytaskid/Fw7J4GeAXE8CMVvLTKyrtBDk/c0bdc0e4-3e23-4273-aeeb-21539e60c53d/

Run Tool Calls

GeneSet Viewer: This tool visualizes the gene-geneset graph. This tool requires at least 2 genesets.

/api/tool/genesetviewer/<apikey>/<homology>/<supressDisconnected>/<minDegree>/<genesets>/

/api/tool/genesetviewer/byprojects/<apikey>/<homology>/<supressDisconnected>/<minDegree>/<projects>/

Variables:

Expected Returns: [“pdf”, ”dot”, ”svg”]

Sample Call: https://geneweaver.org/api/tool/genesetviewer/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/On/Auto/391:394:395/ https://geneweaver.org/api/tool/genesetviewer/byprojects/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/On/Auto/3323:2404/

Jaccard Clustering: This tool displays the Jaccard Distance (a measure of dissimilarity) and is used to cluster genesets. This tool requires at least 3 genesets.

/api/tool/jaccardclustering/<apikey>/<homology>/<method>/<genesets>/

/api/tool/jaccardclustering/byprojects/<apikey>/<homology>/<method>/<projects>/

Variables:

Expected Returns: [“pdf”, ”png”, ”jac”]

Sample Call: https://geneweaver.org/api/tool/jaccardclustering/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/Ward/391:394:395/ https://geneweaver.org/api/tool/jaccardclustering/byprojects/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/Ward/3323:2404/

Jaccard Similarity: This tool computes the Jaccard coefficient, a measure of similarity, for multiple genesets. This tool requires at least 2 genesets.

/api/tool/jaccardsimilarity/<apikey>/<homology>/<pairwiseDeletion>/<genesets>/

/api/tool/jaccardsimilarity/byprojects/<apikey>/<homology>/<pairwiseDeletion>/<projects>/

Variables:

Expected Returns: [“svg”, ”png”, “txt”*]

*the txt follows this format. Rows are separated by newlines, columns by tabs. The first row character is a 0, then tab separated geneset names on the first row. Every following row begins with a geneset name to create the matrix. The values in the corresponding areas are the “jaccardValue:pValue” of those two genes.

Sample Call: http://geneweaver.org/api/tool/jaccardsimilarity/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/Disabled/391:394:395/\ http://geneweaver.org/api/tool/jaccardsimilarity/byprojects/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/Disabled/3323:2404/

Combine: This tool creates a geneset-gene matrix of the combined genesets. This tool requires at least 2 genesets. ''' /api/tool/combine////

/api/tool/combine/byprojects//// '''

Variables:

Expected Returns: [“odemat”]

Sample Call: https://geneweaver.org/api/tool/combine/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/391:394:395/ https://geneweaver.org/api/tool/combine/byprojects/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/3323:2404/

Phenome Map: This tool uses biclique-based analysis to generate hierarchical maps of gene set interactions.

/api/tool/phenomemap/<apikey>/<homology>/<minGenes>/<permutationTimeLimit>/<maxInNode>/<permutations>/
<disableBootstrap>/<minOverlap>/<nodeCutoff>/<geneIsNode>/<useFDR>/<hideUnEmphasized>/<p\_Value>/
<maxLevel>/<genesets>/

/api/tool/phenomemap/byprojects/<apikey>/<homology>/<minGenes>/<permutationTimeLimit>/<maxInNode>/<permutations>/
<disableBootstrap>/<minOverlap>/<nodeCutoff>/<geneIsNode>/<useFDR>/<hideUnEmphasized>/<p\_Value>/
<maxLevel>/<projects>/

Variables:

Expected Returns: [“dot”, ”el.profile”, ”el”, ”graphml”, ”odemat”, ”svg”]

Sample Call: https://geneweaver.org/api/tool/phenomemap/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/1/5/4/100000/False/0%/Auto/All/False/False/1.0/0/391:394:395/ https://geneweaver.org/api/tool/phenomemap/byprojects/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/1/5/4/100000/False/0%/Auto/All/False/False/1.0/0/3323:2404/

Boolean Algebra: This tool searches for genes across genesets.

/api/tool/booleanalgebra/<apikey>/<relation>/<genesets>/

/api/tool/booleanalgebra/byprojects/<apikey>/<relation>/<projects>/

Variables:

Expected Returns: [“txt”]

This file has four sections of raw data separated by newlines. The first section has the method used (Union or Intersect At Least 2). The second section has the resulting genes’ names. The third has the result genes’ ids. The fourth is a 2d array print out of the genesets used to run the tool with all their genes’ ids.

Sample Call: https://geneweaver.org/api/tool/combine/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/391:394:395/ https://geneweaver.org/api/tool/combine/byprojects/Fw7J4GeAXE8CMVvLTKyrtBDk/Included/3323:2404/

Wrapping RestFUl API Code

There are numerous ways to wrap function URLs to ensure that return values are processed. Below is an example of a python method, adapted from http://stackoverflow.com/questions/17301938/making-a-request-to-a-restful-api-using-python

#Python 2.7.6
#RestfulClient.py

import requests
import json

# Replace with the correct URL
url = "http://api_url"

# retrieve API URL
myResponse = requests.get(url)
#print (myResponse.status_code)

# For successful API call, response code will be 200 (OK)
if(myResponse.ok):

    # Loading the response data into a dict variable
    # json.loads takes in only binary or string variables so using content to fetch binary content
    # Loads (Load String) takes a Json file and converts into python data structure (dict or list, depending on JSON)
    jData = json.loads(myResponse.content)

    print("The response contains {0} properties".format(len(jData)))
    print("\n")
    for key in jData:
        print key + " : " + jData[key]
else:
  # If response code is not ok (200), print the resulting http error code with description
    myResponse.raise_for_status()

Example Script

Below is an example of a Python script that makes various Geneweaver API calls. The script will:

  1. Print the information about an example gene set
  2. Print the information about the genes in the example gene set
  3. Create a new project called "Nicotine Studies"
  4. Add the example gene set to the new project
  5. Print the information about all the gene sets owned by the user
  6. Print the information about all the projects owned by the user
  7. Run the GeneSet Viewer tool on the 10 example gene sets
  8. Print the status of the tool job
  9. Print a link to the result of the tool job
  10. Run the Jaccard Clustering tool on the 10 example gene sets
  11. Run the Combine tool on two example projects
  12. Print all of the tasks that the user ran.
# Python 2.7.13
# tutorial-api.py

import httplib
import json
import urllib
import time

# Replace with the correct API key
apikey = "Fw7J4GeAXE8CMVvLTKyrtBDk"

# Prepare the connection to Geneweaver
host = "geneweaver.org"
method = "GET"
connection = httplib.HTTPConnection(host)

# This function takes a GeneWeaver API URL and loads the result in a Python object.
def retrieveApiUrl(url):
    url = urllib.quote(url)
    connection.request(method, url)
    response = connection.getresponse()
    is_successful = response.status == 200 and response.reason == "OK"
    data = response.read() if is_successful else None
    jData = json.loads(data) if is_successful else None
    return jData

# This function waits 10 seconds so that a tool has enough time to run.
def waitForToolToFinish():
    print("Waiting 10 seconds for the task to complete...")
    time.sleep(10)
    print("10 seconds has elapsed, resuming...")
    print("")

"""
Get Gene Set by Gene Set ID:
"""

# Replace with the desired parameters.
GeneSetID = "14888"

# Call the API
url = "/api/get/geneset/byid/{}/".format(GeneSetID)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    for key in jData[0][0]:
        print(key + " : " + str(jData[0][0][key]))
    print("")

"""
Get Genes by Gene Set ID:
"""

# Replace with the desired parameters.
GeneSetID = "14888"

# Call the API
url = "/api/get/genes/bygenesetid/{}/".format(GeneSetID)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    for gene in jData:
        for key in gene[0]:
            print(key + " : " + str(gene[0][key]))
        print("")

"""
Create Project:
"""

# Replace with the desired parameters.
Project_Name = "Nicotine Studies"

# Call the API
url = "/api/add/project/byuser/{}/{}/".format(apikey, Project_Name)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    print("ProjectID = " + str(jData[0]))
    print("")
    
    # Save the ProjectID for later
    ProjectID = jData[0]

"""
Add GeneSet To Project:
"""

# Replace with the desired parameters.
GeneSetID = "14888"

# Call the API
url = "/api/add/geneset/toproject/{}/{}/{}/".format(apikey, ProjectID, GeneSetID)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    print("ProjectID = " + str(jData[0][0]))
    print("GeneSetID = " + str(jData[0][1]))
    print("")

# Add 9 more gene sets
for GeneSetID in ["14889", "14890", "14891", "14892", "14887", "14893", "14885", "86761", "86791"]:
    # Call the API
    url = "/api/add/geneset/toproject/{}/{}/{}/".format(apikey, ProjectID, GeneSetID)
    jData = retrieveApiUrl(url)
    
    # Print the results if successful
    if jData is not None:
        print("ProjectID = " + str(jData[0][0]))
        print("GeneSetID = " + str(jData[0][1]))
        print("")

"""
Get Gene Set by User:
"""

# Call the API
url = "/api/get/geneset/byuser/{}/".format(apikey)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    for gene_set in jData:
        for key in gene_set[0]:
            print(key + " : " + str(gene_set[0][key]))
        print("")

"""
Get Projects by User:
"""

# Call the API
url = "/api/get/project/byuser/{}/".format(apikey)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    for gene_set in jData:
        for key in gene_set[0]:
            print(key + " : " + str(gene_set[0][key]))
        print("")

"""
GeneSet Viewer:
"""

# Replace with the desired parameters.
homology = "Included" # ["Default", "Included","Excluded"]
supressDisconnected = "On" # ["Default", "On","Off"] 
minDegree = "Auto" # ["Default", "Auto", "1","2","3","4","5","10","20"]
genesets = "14888:14889:14890:14891:14892:14887:14893:14885:86761:86791"
FileType = "pdf" # GeneSet Viewer can get ["pdf", "dot", "svg"]

# Call the API
url = "/api/tool/genesetviewer/{}/{}/{}/{}/{}/".format(apikey, homology, supressDisconnected, minDegree, genesets)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    print("TaskID = " + jData)
    print("")
    
    # Save the TaskID for later
    TaskID = jData

# Wait for the task to complete.
waitForToolToFinish()

"""
Get Status of Tool Job:
"""

# Call the API
url = "/api/tool/get/status/{}/".format(TaskID)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    print("Status = " + jData)
    print("")

"""
Get Results Link:
"""

# Replace with the desired parameters.
FileType = "pdf" # See the specific API to check which FileTypes are available.

# Call the API
url = "/api/tool/get/link/{}/{}/{}/".format(apikey, TaskID, FileType)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    result_link = "http://{}{}".format(host, jData)
    
    # Follow this link to get the result file.
    print("File is located at: " + result_link)
    print("")

"""
Jaccard Clustering:
Note: This section combines creating the task and getting the link to the result
"""

# Replace with the desired parameters.
homology = "Included" # ["Default", "Included","Excluded"]
jc_method = "Ward" # ["Default", "Ward", "Single", "Centroid", "McQuitty", "Average", "Complete", "Median"]
genesets = "14888:14889:14890:14891:14892:14887:14893:14885:86761:86791"
FileType = "jac" # Jaccard Clustering can get ["pdf", "png", "jac"]

# Call the API
url = "/api/tool/jaccardclustering/{}/{}/{}/{}/".format(apikey, homology, jc_method, genesets)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    print("TaskID = " + jData)
    print("")
    
    # Save the TaskID for later
    TaskID = jData

# Wait for the task to complete.
waitForToolToFinish()

# Check to see if the task really has completed successfully.
url = "/api/tool/get/status/{}/".format(TaskID)
jData = retrieveApiUrl(url)
if jData is not None:
    print("Status = " + jData)
    print("")

# Call the API
url = "/api/tool/get/link/{}/{}/{}/".format(apikey, TaskID, FileType)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    result_link = "http://{}{}".format(host, jData)
    
    # Follow this link to get the result file.
    print("File is located at: " + result_link)
    print("")

"""
Combine:
"""

# Replace with the desired parameters.
homology = "Included" # ["Default", "Included","Excluded"]
projects = "3323:2404"
FileType = "odemat" # Combine can get ["odemat"]

# Call the API
url = "/api/tool/combine/byprojects/{}/{}/{}/".format(apikey, homology, projects)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    print("TaskID = " + jData)
    print("")
    
    # Save the TaskID for later
    TaskID = jData

# Wait for the task to complete.
waitForToolToFinish()

# Check to see if the task really has completed successfully.
url = "/api/tool/get/status/{}/".format(TaskID)
jData = retrieveApiUrl(url)
if jData is not None:
    print("Status = " + jData)
    print("")

# Call the API
url = "/api/tool/get/link/{}/{}/{}/".format(apikey, TaskID, FileType)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    result_link = "http://{}{}".format(host, jData)
    
    # Follow this link to get the result file.
    print("File is located at: " + result_link)
    print("")

"""
Get Results by User:
"""

# Call the API
url = "/api/get/results/byuser/{}/".format(apikey)
jData = retrieveApiUrl(url)

# Print the results if successful
if jData is not None:
    for task in jData:
        for key in task[0]:
            print(key + " : " + str(task[0][key]))
        print("")

"""
GeneSet Upload:
Note: This section assumes that the file "tutorial_example_data.txt" exists in the current directory
"""

url = "/api/add/geneset/byuser/{}/".format(apikey)

# Replace with your actual file
file_path = "tutorial_example_data.txt"

jData = None
with open(file_path, 'r') as file:
    file_text = file.read()
    
    formData = json.dumps({ "gs_name": "Test 1",
        "gs_abbreviation": "rat heroin-seeking",
        "gs_description": "Test Description",
        "gs_threshold_type": "3",
        "permissions": "private",
        "pub_pubmed": "19664213",
        "sp_id": "3",
        "gene_identifier": "gene_7",
        "file_text": file_text })
    
    jData = postFormAndRetrieveApiUrl(url, formData)

# Print the results if successful
if jData is not None:
    print(jData)

"""
GeneSet URL Upload:
"""

url = "/api/add/geneset/byuser/{}/".format(apikey)

# Replace with your actual file url
file_url = "http://geneweaver.org/docs/tutorial_example_data.txt"

formData = json.dumps({ "gs_name": "Test 2",
    "gs_abbreviation": "rat heroin-seeking",
    "gs_description": "Test Description",
    "gs_threshold_type": "3",
    "permissions": "private",
    "pub_pubmed": "19664213",
    "sp_id": "3",
    "gene_identifier": "gene_7",
    "file_url": file_url })

jData = postFormAndRetrieveApiUrl(url, formData)

# Print the results if successful
if jData is not None:
    print(jData)            

Top ↑

FAQ

FREQUENTLY ASKED QUESTIONS

Q: What is GeneWeaver? What happened to “The Ontological Discovery Environment”?

Q: How is GeneWeaver different from gene set enrichment or ontology over-representation tools?

Q: How do I add my own gene sets to Gene Weaver?

Q: I got great results, but how do I make a high resolution image for my presentation?

Q: How do I add Open Biological Ontology annotation to my gene set?

Q: How do I change the abbreviation, name etc. for my gene set?

Q: I set my threshold too high/low. How do I change it?

Q: I uploaded a file with 200 genes, but it says that my gene set is empty?

Q: A public gene set is improperly labeled. How do I report this?

Q: How are homologous genes identified?

Q: My gene sets are listed as 'deprecated'. What does this mean?

Q: How should I cite Gene Weaver in my research?

Q: What do all the acronyms on the cite stand for? (FAQ#Q:_What_do_all_the_acronyms_on_the_cite_stand_for? "wikilink")

ANSWERS

Q. What is GeneWeaver? What happened to “The Ontological Discovery Environment”?

A. The Ontological Discovery Environment was conceived of as a tool for the integration of biological functions based on the molecular processes that subserved them. From these data, an empirically derived ontology may one day be inferred. Sounds like a mouthful? We think so, too. Moreover, our acronym, ODE, sounds like “ordinary differential equations”, “open development environment”, “Ohio Department of Education”, and the airport in Odense, Denmark. Our users have found the system valuable for a wide range of applications in the arena of functional genomic data integration. While the underlying algorithms of The Ontological Discovery Environment can be extended to many contexts, we chose to rename the system “GeneWeaver” to reflect the emphasis on genes and genomes, allowing our users to weave together the many complex relations among processes, pathways and functions implicit in functional genomics experiments.

Q. How is GeneWeaver different from gene set enrichment or ontology over-representation tools?

There are many statistical tools for the analysis of gene set overrepresentation, and it is indeed possible to perform similar analyses using some of the functions in GeneWeaver. However, GeneWeaver's primary focus and strength is in using gene sets to organize biological functions. GeneWeaver enables highly flexible set-set comparisons of both user submitted and curated gene sets. The suite of combinatorial tools enable large collections of user submitted tools to be compared to each other, and the hierarchical similarity tools enable classification and organization of gene sets based on the genes they contain. This allows discovery of hidden relations among common biological processes, even if those processes have been studied using highly diverse species, analytic methods and approaches. The GeneWeaver tools provide facile data integration and harmonization, and enable user directed integration of new and published results. Major incorporated data from other resources provides a wealth of other sources of contextual information which facilitate interpretation of these discoveries.

Q: How do I add my own gene sets to Gene Weaver?

A: There is a step-by-step guide available in the Wizard.

Q: How do I add Open Biological Ontology annotation to my gene set?

A: Browse to your GeneSet and click the "edit" link. Scroll to the bottom of the page and use the Tree Browser to select entries for your GeneSet. To change the OBO source, use the drop box at the top of the tree display. Finally, to remove any extraneous entries, you can use the little red 'x' on the left side. After saving the changes, your new information will be displayer and your GeneSet will be searchable using any of the ontologies selected.

Q: How do I change the abbreviation, name etc. for my gene set?

A: Browse to your GeneSet and click the "edit" link. Then simply change the values and save your changes. The new text will be displayed immediately.

Q: I set my threshold too high/low. How do I change it?

A: Browse to your GeneSet and click the "edit" link. Then simply fix the thresholds and save the changes. The new thresholds will be applied immediately.

Q:I've got great results, but how do I make a high resolution image for my presentation?

A: Each tool has a link to export the result as a PDF. Save the file and open it in Adobe Acrobat, Inkscape or other software. Save as PNG. This PNG file can be easily inserted into MS Powerpoint presentations or Word documents.

Q: I uploaded a file with 200 genes, but it says that my gene set is empty?

A: If there was no error reported, you probably set your threshold too high/low, see the previous question. If there was an error, your data probably uses a different microarray or gene id type than what was provided on the upload page.

Q: A public gene set is improperly labeled. How do I report this?

A: From the GeneSet's information page, click the "Report Problem with this Page" link in the top right corner and let us know what specifically needs updating.

Q: How are homologous genes identified?

A: We use homologene along with any information provided by the reference genome. ex: RGD provides MGI ids as well.

Q. My gene sets are listed as 'deprecated'. What does this mean?

A. If a newer version of a Gene Set in one of your projects is available, the version you stored is marked "deprecated." Clicking on the provided icon will update your project with the latest version of this data. New versions are available when we update data from external sources, e.g. MP and GO annotations, or when the GeneSet Metadata has been updated.

Q: How should I cite Gene Weaver in my research?

A: Please cite: Erich J. Baker, Jeremy J. Jay, Jason A. Bubier, Michael A. Langston, and Elissa J. Chesler. GeneWeaver: a web-based system for integrative functional genomics. Nucl. Acids Res. (2012) 40(D1): D1067-D1076

Q: What do all the acronyms on the cite stand for?

A: DRG (Drug-Related Genes), CTD (Comparative Toxicogenomics Database), MP (Mammalian Phenotype Ontology), HP (Human Phenotype Ontology), ABA (Allen Brain Atlas), GO (Gene Ontology), MeSH (Medical Subject Headings).

Top ↑

Searching GeneWeaver

Our database includes data obtained from numerous external data resources. GeneWeaver allows users to conduct text searches on metadata and raw data stored in out database. These include searches by permission level, species, curation tier, gene set information or genes of interest. Please see our Search Help page for more details on text search options. Occasionally it is useful to search for gene sets anchored on genes or gene sets of interest based on their overlap with neighboring gene sets. Anchored Biclique of Biomolecular Associations (ABBA) is a tool that allows you to accomplish this task.

Top ↑

Gene Set Utilities

GeneSet Details Pages allow users to view vital information about gene sets of interest, including associated genes, homologs, and references to external links. Gene Intersection Lists are useful for determining which information is shared between gene sets of interest. In addition, GeneWeaver tools allow users to Combine gene sets of interest or perform more complex set operations based on Boolean Algebra. Gene sets may also be annotated with information about Emphasis Genes, allowing users to augment GeneWeaver tools with gene-specific information.

Top ↑

Software

Top ↑

Data

Available Data

All of the Publications referenced by GeneWeaver GeneSets have been collected into a EndNote formatted library that can be found here.

Data Export

GeneWeaver also allows users to download genesets that they have permission to access. Available formats are:

In order to use the export function, visit an available GeneSet page, and select Export Data from the right hand column (see the Figure below). This will bring up a modal where you can select the appropriate format. Depending on your browser settings, the download should start automatically.

Top ↑

Installation

The GeneWeaver interface is open source and freely available from our git repository hosted by Bitbucket. Although, due to security, Bitbucket is password protected. Please contact us for appropriate permissions.

Top ↑

Other Important Links

Top ↑

Publications

How to cite GeneWeaver

Erich J. Baker, Jeremy J. Jay, Jason A. Bubier, Michael A. Langston, and Elissa J. Chesler. GeneWeaver: a web-based system for integrative functional genomics. Nucleic Acids Research; (2012) 40(D1): D1067-D1076

Publications Describing GeneWeaver

Other Relevant GeneWeaver Citations

GeneNetwork

QR Code

QR Code to Main Page

QR Code to Main Page

Top ↑

Policies

Usage Policy and Disclaimer

Data and web site providers make no guarantees or warranties as to the accuracy or completeness of results obtained from accessing and using information from GeneWeaver. We will not be liable to any user or anyone else for any inaccuracy, error or omission, regardless of cause, in the data contained in the GeneWeaver databases or any resulting damages. In addition, the data providers do not warrant that the databases will meet your requirements, be uninterrupted, or error-free. Data providers expressly exclude and disclaim all expressed and implied warranties of merchantability and fitness for a particular purpose. Data providers shall not be responsible for any damage or loss of any kind arising out of or related to your use 'of the databases, including without limitation data loss or corruption, regardless of whether such liability is based in tort, contract, or otherwise.

To report any errors found in the GeneWeaver database, please notify the appropriate person listed on our Contacts page.

Data Sharing Policy

Data sharing in GeneWeaver is as broad or restrictive as the investigator allows. When uploading data, it can be made private, public, or accessible only to selected groups. Access restrictions can be changed at any time. All group members are also visible on the groups page. The only people with access to your data are those who you personally allow, or those who your group administrator(s) allow. GeneWeaver will make no use of the data outside of normal metrics used to optimize algorithm or database efficiency, or in other internal use solely for the development of GeneWeaver, see Privacy Policy for more.

In addition, our directives to share data stem from the NIH Data Sharing Policy that states:

Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data.

Privacy Policy

Top ↑

Contacts

For questions about...

GeneWeaver vision and future direction:

Software questions and reporting bugs:

Database design:

Graph algorithms:

Data curation

Top ↑

These pages are maintained by the GeneWeaver team and the Chesler Lab at The Jackson Laboratory in Bar Harbor, Maine.