The PhyloPandas DataFrame

The phylopandas dataframe is the core datastructure in this package. It defines a set of columns (or grammar) for phylogenetic data. A few advantages of defining such a grammar is: 1) we can leverage powerful+interactive visualization tools like Vega and 2) we standardize phylogenetic data in a familiar format.

Columns of a Phylopandas DataFrame

When reading sequence data, the following information will be stored on the dataframe.

  1. sequence : DNA or protein sequence.
  2. id: user defined label or identifier.
  3. description: user defined description.

When reading tree data, the following information will be stored on the dataframe.

  1. type : label describing the type of node; either “leaf” or “node”.
  2. parent : label of parent node.
  3. branch_length : distance from parent node.

PhyloPandas indexes each sequence using a randomly generated 10 character key.

If reading tree data from a PhyloPandas DataFrame containing sequence data, the two dataframes will be merged on the randomly generated index (unless otherwise specified).

If reading sequence data from a PhyloPandas DataFrae containing tree data, the two dataframes will be merged on the randomly generated index (unless otherwise specified).