Binary Upper Triangular Matrix (BUTLR) File Format

The Binary Upper TrianguLar MatRix, or BUTLR (.butlr, .btr), file format encodes contact matrices derived from chromosome conformation capture (3C)-based technology (i.e. Hi-C). The BUTLR format adds binary indexed encoding, allowing the random access of contact matrices and enhancing performances in both computing time and memory. In the 3D Genome Browser, such random access is especially important, as it enables the remote access of contact matrix files hosted on other servers, directly querying the region of interest instead of necessitating the upload of entire files onto our server. Additionally, the BUTLR format reduces the storage space of contact matrix files, not only through the binarization of files but also through the omission of redundant values. Perhaps the most striking example in this regard is that a 1-kb hg19 intrachromosomal contract matrix requires ~1 TB of storage for the tab-delimited matrix file and ~32GB for the coordinated list format. In comparison, the BUTLR format only requires ~11GB of storage, in addition to the benefit of random access.

To browse and visualize user-generated contact matrix files or any other datasets not hosted on our server, convert them to the BUTLR file format with BUTLRTools, a set of Perl scripts, available at Github. Ensure to download the entire folder instead of separate scripts as they are reliant on modules located in the same folder.

Table of Contents
I. Contact Matrix Formats: Tab-Delimited vs. Coordinated List
II. Output of Different Hi-C Pipelines (homerToMatrix.pl)
III. Convert to BUTLR (matrixToButlr.pl)
IV. Convert from BUTLR (butlrToMatrix.pl)
V. Upload to the 3D Genome Browser
VI. Troubleshooting Common Errors

I. Contact Matrix Formats: Tab-Delimited vs. Coordinated List

The tab-delimited text file encodes the entire contact matrix with each element in columns separated by tabs "\t" and each element in rows, newlines "\n". Hi-C datasets generated by the Bing Ren group, some of which are hosted under the 'DOWNLOAD' tab, are in the tab-delimited format. Below is a simplified example of a tab-delimited file format of a 5 × 5 contact matrix.
24	7	3	0	0
7	16	10	5	0
3	10	19	12	4
0	5	12	21	9
0	0	4	9	17

This contact matrix stores intrachromosomal interactions. Notice that the values of the matrix are mirrored across its diagnonal, which makes sense biologically as the number of contacts between locus i and locus j will equal those between locus j and locus i. This redundacy would be a source to be resolved by the BUTLR file format to reduce storage. The current matrix is 91 bytes.

The coordinated list is a form of representing the sparse matrix, a matrix in which most of the elements are zero. The contact matrix of the mammalian genomes is sparse, as most of the interactions are highly localized. Since most of the elements in the sparse matrix are the same, these elements are often not stored. The coordinated list stores a list of tuples (row, column, value) describing all nonzero elements in the sparse matrix. In addition, since the entire matrix does not to be stored anymore, the diagonally mirrored values could be also neglected in storage, as the contact information stored in (row i, column j) also reflects the value of (row j, column i). The Hi-C datasets generated and shared by the Erez Lieberman Aiden group, in particular the Rao et al 2014 paper, are in this format. Below is an example of a coordinated list of the tab-delimited matrix above. Let us assume that the resolution is 40kb.

0	0	24
0	40000	7
0	80000	3
40000	40000	16
40000	80000	10
40000	120000	5
80000	80000	19
80000	120000	12
80000	160000	4
120000	120000	21
120000	160000	9
160000	160000	17

The size of this example is 207 bytes, compared to the tab-delimited text example of 91 bytes. Astute readers may notice that this example is not a sparse matrix, but rather a dense matrix. In a dense matrix, a coordinate list may not reduce storage as the storing row and column positions may offset the omitted zeros. As noted, the mammalian genome contact matrices, however, are sparse matrices which would benefit with the coordinate list format.

II. Output of Different Hi-C Pipelines (homerToMatrix.pl)

As intimated from the previous section, different laboratory groups utilize different pipelines that yield contact matrices in different format. The Ren group pipeline outputs contact matrices stored as tab-delimited text format in separate files as divided by chromosomes (chr1.matrix for chromosome 1, chr2.matrix for chromosome 2, chr3.matrix for chromosome 3, etc). The datasets from Dixon et al 2012 paper are hosted under the 'DOWNLOAD' tab are in this format. Also described above, the Lieberman Aiden group provided the Rao et al 2014 datasets in the coordinated list format, also separated by chromosomes. Meanwhile, another established Hi-C pipeline from HOMER yields yet a third format, one tab-delimited file that contains all chromosomes and all their intra- and inter-chromosomal interactions. Click the figure below for more information.
Image from HOMER

In an effort to standardize the formats, the script homerToMatrix.pl could be utilized to separate the one comprehensive HOMER output by chromosomes, to a format similar to that of Ren group, in preparation for the conversion to the BUTLR format. A HOMER matrix could be converted to tab-delimited text format separated by chromosomes with
perl homerToMatrix.pl -m <homer matrix> -g <genome size file> [-o <output file prefix>]

This command will yield files with [output file prefix].[chrom1].[chrom2].matrix, where [chrom1] provides the bins by rows and [chrom2] provides the bins by columns. Since interchromosomal matrices [output file prefix].[chrom1].[chrom2].matrix and [output file prefix].[chrom2].[chrom1].matrix are tranpose matrices that provide redudant interaction entries, only one file, where the size of [chrom1] (reflected by number of rows) will be greater than [chrom2] will be created. Additionally the command will output a matrix list file, [output file prefix].list, which would serve as an input in the script to convert to BUTLR, matrixToButlr.pl

III. Convert to BUTLR (matrixToButlr.pl)

The Perl script, matrixToButlr.pl converts contact matrices in the tab-delimited text, chromosome-separated format (Ren group output) into the BUTLR for visualization through the 3D Genome Browser. The general command is as follows:
perl matrixToButlr.pl -g <genome size file> -m <matrix list file> -a <genome assembly> -r <resolution>
[-h <row number where matrix begins (1-based)>][-o <output filename>]

Genome Size File

Two-column, tab-delimited file with column 1 containing chromosome names and column 2, size of the corresponding chromosome. These files are often available as [assembly].chrom.sizes through browsing the genome assembly datasets on the UCSC Genome Browser Downloads page (using Google is faster). If the file is not available, one could be created using the twoBitInfo program from the UCSC Utilities on the [assembly].2bit files (which itself could be converted from fasta files through faToTwoBit provided by UCSC or created by user. Another option is to query the UCSC MySQL table chromInfo under each database named after genome assemblies. Below is an example of genome size file for hg19:
chr1	249250621
chr2	243199373
chr3	198022430
chr4	191154276
chr5	180915260
chr6	171115067
chr7	159138663
chrX	155270560
chr8	146364022
chr9	141213431
chr10	135534747
chr11	135006516
chr12	133851895
chr13	115169878
chr14	107349540
chr15	102531392
chr16	90354753
chr17	81195210
chr18	78077248
chr20	63025520
chrY	59373566
chr19	59128983
chr22	51304566
chr21	48129895

Matrix List File

Intrachromosomal Matrices Only: Two-column, tab-delimited file with column 1 containing chromosome names and column 2, filenames of each contact matrix (tab-delimited text format) for each corresponding chromosome. Example (matrix.list):
chr1	/directory-to-file/chr1.matrix
chr2	/directory-to-file/chr2.matrix
chr3	/directory-to-file/chr3.matrix
chr4	/directory-to-file/chr4.matrix
chr5	/directory-to-file/chr5.matrix
chr6	/directory-to-file/chr6.matrix
chr7	/directory-to-file/chr7.matrix
chrX	/directory-to-file/chrX.matrix
chr8	/directory-to-file/chr8.matrix
chr9	/directory-to-file/chr9.matrix
chr10	/directory-to-file/chr10.matrix
chr11	/directory-to-file/chr11.matrix
chr12	/directory-to-file/chr12.matrix
chr13	/directory-to-file/chr13.matrix
chr14	/directory-to-file/chr14.matrix
chr15	/directory-to-file/chr15.matrix
chr16	/directory-to-file/chr16.matrix
chr17	/directory-to-file/chr17.matrix
chr18	/directory-to-file/chr18.matrix
chr20	/directory-to-file/chr20.matrix
chr19	/directory-to-file/chr19.matrix
chr22	/directory-to-file/chr22.matrix
chr21	/directory-to-file/chr21.matrix

Inclusion of Interchromosomal Matrices: Three-column, tab-delimited file with column 1 containing name of chrom1, column 2, chrom2 and column 3, filenames of each contact matrix (tab-delimited text format) for interaction between chrom1 and chrom2. Since interchromosomal matrices [file prefix].[chrom1].[chrom2].matrix and [file prefix].[chrom2].[chrom1].matrix are tranpose matrices that provide redudant interaction entries, only one file out of both is needed. Currently, the script will only support interchromosomal matrix inputs where number of rows > number of columns, so its transpose matrix will cause an error. While the current 3D Genome Browser does not support the visualization of interchromosomal interactions, there are plans to develop this feature in the future. In addition, the current BUTLR supports the encoding of interchromosomal interaction matrices, which allows random access as well as binary and redundancy compression.
chr1	chr1	/directory-to-file/chr1.chr1.matrix
chr1	chr2	/directory-to-file/chr1.chr2.matrix
chr1	chr3	/directory-to-file/chr1.chr3.matrix
chr1	chr4	/directory-to-file/chr1.chr4.matrix
chr1	chr5	/directory-to-file/chr1.chr5.matrix
chr1	chr6	/directory-to-file/chr1.chr6.matrix
chr1	chr7	/directory-to-file/chr1.chr7.matrix
chr1	chrX	/directory-to-file/chr1.chrX.matrix
chr1	chr8	/directory-to-file/chr1.chr8.matrix
chr1	chr9	/directory-to-file/chr1.chr9.matrix
chr1	chr10	/directory-to-file/chr1.chr10.matrix
chr1	chr11	/directory-to-file/chr1.chr11.matrix
chr1	chr12	/directory-to-file/chr1.chr12.matrix
chr1	chr13	/directory-to-file/chr1.chr13.matrix
chr1	chr14	/directory-to-file/chr1.chr14.matrix
chr1	chr15	/directory-to-file/chr1.chr15.matrix
chr1	chr16	/directory-to-file/chr1.chr16.matrix
chr1	chr17	/directory-to-file/chr1.chr17.matrix
chr1	chr18	/directory-to-file/chr1.chr18.matrix
chr1	chr20	/directory-to-file/chr1.chr20.matrix
chr1	chr19	/directory-to-file/chr1.chr19.matrix
chr1	chr22	/directory-to-file/chr1.chr22.matrix
chr1	chr21	/directory-to-file/chr1.chr21.matrix
chr2	chr2	/directory-to-file/chr2.chr2.matrix
chr2	chr3	/directory-to-file/chr2.chr3.matrix
chr2	chr4	/directory-to-file/chr2.chr4.matrix
chr2	chr5	/directory-to-file/chr2.chr5.matrix
chr2	chr6	/directory-to-file/chr2.chr6.matrix
chr2	chr7	/directory-to-file/chr2.chr7.matrix
chr2	chrX	/directory-to-file/chr2.chrX.matrix
chr2	chr8	/directory-to-file/chr2.chr8.matrix
chr2	chr9	/directory-to-file/chr2.chr9.matrix
chr2	chr10	/directory-to-file/chr2.chr10.matrix
chr2	chr11	/directory-to-file/chr2.chr11.matrix
chr2	chr12	/directory-to-file/chr2.chr12.matrix
chr2	chr13	/directory-to-file/chr2.chr13.matrix
chr2	chr14	/directory-to-file/chr2.chr14.matrix
chr2	chr15	/directory-to-file/chr2.chr15.matrix
chr2	chr16	/directory-to-file/chr2.chr16.matrix
chr2	chr17	/directory-to-file/chr2.chr17.matrix
chr2	chr18	/directory-to-file/chr2.chr18.matrix
chr2	chr20	/directory-to-file/chr2.chr20.matrix
chr2	chr19	/directory-to-file/chr2.chr19.matrix
chr2	chr22	/directory-to-file/chr2.chr22.matrix
chr2	chr21	/directory-to-file/chr2.chr21.matrix
chr3	chr3	/directory-to-file/chr3.chr3.matrix
chr3	chr4	/directory-to-file/chr3.chr4.matrix
chr3	chr5	/directory-to-file/chr3.chr5.matrix
chr3	chr6	/directory-to-file/chr3.chr6.matrix
chr3	chr7	/directory-to-file/chr3.chr7.matrix
chr3	chrX	/directory-to-file/chr3.chrX.matrix
chr3	chr8	/directory-to-file/chr3.chr8.matrix
chr3	chr9	/directory-to-file/chr3.chr9.matrix
chr3	chr10	/directory-to-file/chr3.chr10.matrix
chr3	chr11	/directory-to-file/chr3.chr11.matrix
chr3	chr12	/directory-to-file/chr3.chr12.matrix
chr3	chr13	/directory-to-file/chr3.chr13.matrix
chr3	chr14	/directory-to-file/chr3.chr14.matrix
chr3	chr15	/directory-to-file/chr3.chr15.matrix
chr3	chr16	/directory-to-file/chr3.chr16.matrix
chr3	chr17	/directory-to-file/chr3.chr17.matrix
chr3	chr18	/directory-to-file/chr3.chr18.matrix
chr3	chr20	/directory-to-file/chr3.chr20.matrix
chr3	chr19	/directory-to-file/chr3.chr19.matrix
chr3	chr22	/directory-to-file/chr3.chr22.matrix
chr3	chr21	/directory-to-file/chr3.chr21.matrix
.
.
.
chr22	chr22	/directory-to-file/chr22.chr22.matrix
chr22	chr21	/directory-to-file/chr22.chr21.matrix
chr21	chr21	/directory-to-file/chr21.chr21.matrix

Genome Assembly

Genome assembly of the Hi-C dataset/contact matrices. This field is important to display the correct UCSC Genome Browser Session. If the assembly is user-generated, please refer to the guide to create assembly hubs on UCSC. After the hub is created, simply copy and paste the URL to the UCSC Genome Browser Session ID textbox.

Resolution of Contact Matrix

The resolution of the contact matrix, which could be provided as base-pairs (40000), kbps (40kb) or even Mkps (1Mb).

Option: -h <rrow number where matrix begins (1-based)>

Specify the number of lines of headers (1-based). For example, with one line of header, the matrix will start at row number 2 (-h 2). Default: 1 (no header).
There is currently no options for establishing the number of columns where matrix begins (existence of row names). The script will assume that any additional columns at beginning compared to the number of bins calculated by chromosome size and matrix resolution are row names and disregard accordingly.

Example

Below is an example of creating BUTLR files from hg19 GM12878 at 40kb:
perl matrixToButlr.pl -g hg19.chrom.sizes -m matrix.list -a hg19 -r 40kb -o GM12878.40kb.btr

Sanity Checks

To minimize errors, ensure that the matrix is legal. Manually calculate the expected number of bins with ceil( chrom_size / resolution )
Make sure that this number matches the number of rows in the contact matrix
wc -l <chrom.matrix>
And it matches the number of columns in the contact matrix. Any additional output here would signify irregularities.
awk -F "\t" -v OFS="\t" '{print NF}' <chrom.matrix> | uniq
Note: the datasets from Dixon et al 2012 hosted here have an extra tab at the end so they do not pass this sanity check. Fortunately, BUTLRTools deal gracefully with extra white spaces.

IV. Convert from BUTLR (butlrToMatrix.pl)

TBA

V. Upload to the 3D Genome Browser

TBA

VI. Troubleshooting Common Errors

TBA