back to main page

Data preprocessing for analyzing the assemblies¶

This notebook includes all scripts used for data preprocessing and assembly analysis. For figure generation please refer figure generation.

MUMMer for synteny coords¶

In [1]:
cd /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/MUMer
In [2]:
# Step 1: Alignment of both genomes using nucmer
nucmer --maxmatch -p synteny ../genomes/DSM158.fasta ../genomes/GCF_049434525.1_MWCSPHH2ANNA_genomic.fasta

# Step 2: match-filtering
delta-filter -1 synteny.delta > synteny.filtered.delta

# Step 3: Create Coord-file (for plotting)
show-coords -rcl -T synteny.filtered.delta > synteny.coords
1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS
# reading input file "synteny.ntref" of length 4520363
# construct suffix tree for sequence of length 4520363
# (maximum reference length is 536870908)
# (maximum query length is 4294967295)
# process 45203 characters per dot
#....................................................................................................
# CONSTRUCTIONTIME /usr/bin/mummer synteny.ntref 1.30
# reading input file "/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/MUMer/../genomes/GCF_049434525.1_MWCSPHH2ANNA_genomic.fasta" of length 4520329
# matching query-file "/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/MUMer/../genomes/GCF_049434525.1_MWCSPHH2ANNA_genomic.fasta"
# against subject-file "synteny.ntref"
# COMPLETETIME /usr/bin/mummer synteny.ntref 4.19
# SPACE /usr/bin/mummer synteny.ntref 8.75
4: FINISHING DATA

Mash-distance calculation¶

In [3]:
# install Mash (if not installed)
cd /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes
conda activate mash
mash sketch -o ./all *.fasta # reduce sequences
mash dist all.msh all.msh > ../mash/dist.tab #calculate distance matrix
conda deactivate
Sketching DSM158.fasta...
Sketching GCF_000012905.2_ASM1290v2_genomic.fasta...
Sketching GCF_000015985.1_ASM1598v1_genomic.fasta...
Sketching GCF_000021005.1_ASM2100v1_genomic.fasta...
Sketching GCF_000212605.1_ASM21260v1_genomic.fasta...
Sketching GCF_000269625.1_PB_Rhod_Spha_2_4_1_V1_genomic.fasta...
Sketching GCF_000273405.1_Rhod_Spha_2_4_1_V1_genomic.fasta...
Sketching GCF_001576595.1_ASM157659v1_genomic.fasta...
Sketching GCF_001685625.1_ASM168562v1_genomic.fasta...
Sketching GCF_002706325.1_ASM270632v1_genomic.fasta...
Sketching GCF_003324715.1_ASM332471v1_genomic.fasta...
Sketching GCF_003846365.1_ASM384636v1_genomic.fasta...
Sketching GCF_003846385.1_ASM384638v1_genomic.fasta...
Sketching GCF_003846405.1_ASM384640v1_genomic.fasta...
Sketching GCF_003846425.1_ASM384642v1_genomic.fasta...
Sketching GCF_012647365.1_ASM1264736v1_genomic.fasta...
Sketching GCF_049434525.1_MWCSPHH2ANNA_genomic.fasta...
Sketching GCF_052246835.1_ASM5224683v1_genomic.fasta...
Writing to ./all.msh...

Quality Assessment of the genomes¶

busco (completeness)¶

In [4]:
conda activate busco
cd /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes
busco -i ./DSM158.fasta -m genome -l rhodobacter_odb12 -c 20 -o DSM # strain DSM158
busco -i ./GCF_000012905.2_ASM1290v2_genomic.fasta -m genome -l rhodobacter_odb12 -c 20 -o NCBI_Ref # ncbi reference
busco -i ./GCF_049434525.1_MWCSPHH2ANNA_genomic.fasta -m genome -l rhodobacter_odb12 -c 20 -o SUBH2 # substrain H2
conda deactivate
2025-10-30 10:16:28 INFO:	***** Start a BUSCO v6.0.0 analysis, current time: 10/30/2025 10:16:28 *****
2025-10-30 10:16:28 INFO:	Configuring BUSCO with local environment
2025-10-30 10:16:28 INFO:	Running genome mode
2025-10-30 10:16:28 INFO:	Downloading information on latest versions of BUSCO data...
2025-10-30 10:16:30 INFO:	Input file is /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/DSM158.fasta
2025-10-30 10:16:30 INFO:	The local file or folder /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/busco_downloads/lineages/rhodobacter_odb12 is the last available version.
2025-10-30 10:16:31 INFO:	Running BUSCO using lineage dataset rhodobacter_odb12 (prokaryota, 2025-05-14)
2025-10-30 10:16:31 INFO:	Running 1 job(s) on bbtools, starting at 10/30/2025 10:16:31
2025-10-30 10:16:32 INFO:	[bbtools]	1 of 1 task(s) completed
2025-10-30 10:16:32 INFO:	***** Run Prodigal on input to predict and extract genes *****
2025-10-30 10:16:32 INFO:	Running Prodigal with genetic code 11 in single mode
2025-10-30 10:16:32 INFO:	Running 1 job(s) on prodigal, starting at 10/30/2025 10:16:32
2025-10-30 10:16:41 INFO:	[prodigal]	1 of 1 task(s) completed
2025-10-30 10:16:42 INFO:	Genetic code 11 selected as optimal
2025-10-30 10:16:42 INFO:	***** Run HMMER on gene sequences *****
2025-10-30 10:16:42 INFO:	Running 1343 job(s) on hmmsearch, starting at 10/30/2025 10:16:42
2025-10-30 10:16:44 INFO:	[hmmsearch]	135 of 1343 task(s) completed
2025-10-30 10:16:45 INFO:	[hmmsearch]	269 of 1343 task(s) completed
2025-10-30 10:16:45 INFO:	[hmmsearch]	403 of 1343 task(s) completed
2025-10-30 10:16:46 INFO:	[hmmsearch]	538 of 1343 task(s) completed
2025-10-30 10:16:47 INFO:	[hmmsearch]	672 of 1343 task(s) completed
2025-10-30 10:16:48 INFO:	[hmmsearch]	806 of 1343 task(s) completed
2025-10-30 10:16:48 INFO:	[hmmsearch]	941 of 1343 task(s) completed
2025-10-30 10:16:49 INFO:	[hmmsearch]	1075 of 1343 task(s) completed
2025-10-30 10:16:50 INFO:	[hmmsearch]	1209 of 1343 task(s) completed
2025-10-30 10:16:52 INFO:	[hmmsearch]	1343 of 1343 task(s) completed
2025-10-30 10:16:53 INFO:	Results:	C:98.2%[S:98.0%,D:0.2%],F:0.5%,M:1.3%,n:1343	   

2025-10-30 10:16:53 INFO:	

    ---------------------------------------------------
    |Results from dataset rhodobacter_odb12            |
    ---------------------------------------------------
    |C:98.2%[S:98.0%,D:0.2%],F:0.5%,M:1.3%,n:1343      |
    |1319    Complete BUSCOs (C)                       |
    |1316    Complete and single-copy BUSCOs (S)       |
    |3    Complete and duplicated BUSCOs (D)           |
    |7    Fragmented BUSCOs (F)                        |
    |17    Missing BUSCOs (M)                          |
    |1343    Total BUSCO groups searched               |
    ---------------------------------------------------
2025-10-30 10:16:53 INFO:	BUSCO analysis done. Total running time: 23 seconds
2025-10-30 10:16:53 INFO:	Results written in /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/DSM
2025-10-30 10:16:53 INFO:	For assistance with interpreting the results, please consult the userguide: https://busco.ezlab.org/busco_userguide.html

2025-10-30 10:16:53 INFO:	Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO
2025-10-30 10:16:53 INFO:	Thank you for using BUSCO! Anonymous usage data is gathered to improve the tool. You may opt out with --opt-out-run-stats.
2025-10-30 10:16:54 INFO:	***** Start a BUSCO v6.0.0 analysis, current time: 10/30/2025 10:16:54 *****
2025-10-30 10:16:54 INFO:	Configuring BUSCO with local environment
2025-10-30 10:16:54 INFO:	Running genome mode
2025-10-30 10:16:54 INFO:	Downloading information on latest versions of BUSCO data...
2025-10-30 10:16:57 INFO:	Input file is /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/GCF_000012905.2_ASM1290v2_genomic.fasta
2025-10-30 10:16:57 INFO:	The local file or folder /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/busco_downloads/lineages/rhodobacter_odb12 is the last available version.
2025-10-30 10:16:57 INFO:	Running BUSCO using lineage dataset rhodobacter_odb12 (prokaryota, 2025-05-14)
2025-10-30 10:16:57 INFO:	Running 1 job(s) on bbtools, starting at 10/30/2025 10:16:57
2025-10-30 10:16:58 INFO:	[bbtools]	1 of 1 task(s) completed
2025-10-30 10:16:58 INFO:	***** Run Prodigal on input to predict and extract genes *****
2025-10-30 10:16:58 INFO:	Running Prodigal with genetic code 11 in single mode
2025-10-30 10:16:58 INFO:	Running 1 job(s) on prodigal, starting at 10/30/2025 10:16:58
2025-10-30 10:17:09 INFO:	[prodigal]	1 of 1 task(s) completed
2025-10-30 10:17:09 INFO:	Genetic code 11 selected as optimal
2025-10-30 10:17:09 INFO:	***** Run HMMER on gene sequences *****
2025-10-30 10:17:09 INFO:	Running 1343 job(s) on hmmsearch, starting at 10/30/2025 10:17:09
2025-10-30 10:17:11 INFO:	[hmmsearch]	135 of 1343 task(s) completed
2025-10-30 10:17:12 INFO:	[hmmsearch]	269 of 1343 task(s) completed
2025-10-30 10:17:12 INFO:	[hmmsearch]	403 of 1343 task(s) completed
2025-10-30 10:17:13 INFO:	[hmmsearch]	538 of 1343 task(s) completed
2025-10-30 10:17:14 INFO:	[hmmsearch]	672 of 1343 task(s) completed
2025-10-30 10:17:15 INFO:	[hmmsearch]	806 of 1343 task(s) completed
2025-10-30 10:17:15 INFO:	[hmmsearch]	941 of 1343 task(s) completed
2025-10-30 10:17:16 INFO:	[hmmsearch]	1075 of 1343 task(s) completed
2025-10-30 10:17:17 INFO:	[hmmsearch]	1209 of 1343 task(s) completed
2025-10-30 10:17:19 INFO:	[hmmsearch]	1343 of 1343 task(s) completed
2025-10-30 10:17:20 INFO:	Results:	C:98.5%[S:98.3%,D:0.2%],F:0.4%,M:1.1%,n:1343	   

2025-10-30 10:17:21 INFO:	

    ---------------------------------------------------
    |Results from dataset rhodobacter_odb12            |
    ---------------------------------------------------
    |C:98.5%[S:98.3%,D:0.2%],F:0.4%,M:1.1%,n:1343      |
    |1323    Complete BUSCOs (C)                       |
    |1320    Complete and single-copy BUSCOs (S)       |
    |3    Complete and duplicated BUSCOs (D)           |
    |5    Fragmented BUSCOs (F)                        |
    |15    Missing BUSCOs (M)                          |
    |1343    Total BUSCO groups searched               |
    ---------------------------------------------------
2025-10-30 10:17:21 INFO:	BUSCO analysis done. Total running time: 24 seconds
2025-10-30 10:17:21 INFO:	Results written in /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/NCBI_Ref
2025-10-30 10:17:21 INFO:	For assistance with interpreting the results, please consult the userguide: https://busco.ezlab.org/busco_userguide.html

2025-10-30 10:17:21 INFO:	Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO
2025-10-30 10:17:21 INFO:	Thank you for using BUSCO! Anonymous usage data is gathered to improve the tool. You may opt out with --opt-out-run-stats.
2025-10-30 10:17:22 INFO:	***** Start a BUSCO v6.0.0 analysis, current time: 10/30/2025 10:17:22 *****
2025-10-30 10:17:22 INFO:	Configuring BUSCO with local environment
2025-10-30 10:17:22 INFO:	Running genome mode
2025-10-30 10:17:22 INFO:	Downloading information on latest versions of BUSCO data...
2025-10-30 10:17:24 INFO:	Input file is /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/GCF_049434525.1_MWCSPHH2ANNA_genomic.fasta
2025-10-30 10:17:24 INFO:	The local file or folder /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/busco_downloads/lineages/rhodobacter_odb12 is the last available version.
2025-10-30 10:17:24 INFO:	Running BUSCO using lineage dataset rhodobacter_odb12 (prokaryota, 2025-05-14)
2025-10-30 10:17:24 INFO:	Running 1 job(s) on bbtools, starting at 10/30/2025 10:17:24
2025-10-30 10:17:26 INFO:	[bbtools]	1 of 1 task(s) completed
2025-10-30 10:17:26 INFO:	***** Run Prodigal on input to predict and extract genes *****
2025-10-30 10:17:26 INFO:	Running Prodigal with genetic code 11 in single mode
2025-10-30 10:17:26 INFO:	Running 1 job(s) on prodigal, starting at 10/30/2025 10:17:26
2025-10-30 10:17:35 INFO:	[prodigal]	1 of 1 task(s) completed
2025-10-30 10:17:35 INFO:	Genetic code 11 selected as optimal
2025-10-30 10:17:35 INFO:	***** Run HMMER on gene sequences *****
2025-10-30 10:17:35 INFO:	Running 1343 job(s) on hmmsearch, starting at 10/30/2025 10:17:35
2025-10-30 10:17:38 INFO:	[hmmsearch]	135 of 1343 task(s) completed
2025-10-30 10:17:38 INFO:	[hmmsearch]	269 of 1343 task(s) completed
2025-10-30 10:17:39 INFO:	[hmmsearch]	403 of 1343 task(s) completed
2025-10-30 10:17:40 INFO:	[hmmsearch]	538 of 1343 task(s) completed
2025-10-30 10:17:41 INFO:	[hmmsearch]	672 of 1343 task(s) completed
2025-10-30 10:17:41 INFO:	[hmmsearch]	806 of 1343 task(s) completed
2025-10-30 10:17:42 INFO:	[hmmsearch]	941 of 1343 task(s) completed
2025-10-30 10:17:43 INFO:	[hmmsearch]	1075 of 1343 task(s) completed
2025-10-30 10:17:44 INFO:	[hmmsearch]	1209 of 1343 task(s) completed
2025-10-30 10:17:45 INFO:	[hmmsearch]	1343 of 1343 task(s) completed
2025-10-30 10:17:47 INFO:	Results:	C:98.2%[S:98.1%,D:0.1%],F:0.5%,M:1.3%,n:1343	   

2025-10-30 10:17:47 INFO:	

    ---------------------------------------------------
    |Results from dataset rhodobacter_odb12            |
    ---------------------------------------------------
    |C:98.2%[S:98.1%,D:0.1%],F:0.5%,M:1.3%,n:1343      |
    |1319    Complete BUSCOs (C)                       |
    |1317    Complete and single-copy BUSCOs (S)       |
    |2    Complete and duplicated BUSCOs (D)           |
    |7    Fragmented BUSCOs (F)                        |
    |17    Missing BUSCOs (M)                          |
    |1343    Total BUSCO groups searched               |
    ---------------------------------------------------
2025-10-30 10:17:47 INFO:	BUSCO analysis done. Total running time: 23 seconds
2025-10-30 10:17:47 INFO:	Results written in /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/SUBH2
2025-10-30 10:17:47 INFO:	For assistance with interpreting the results, please consult the userguide: https://busco.ezlab.org/busco_userguide.html

2025-10-30 10:17:47 INFO:	Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO
2025-10-30 10:17:47 INFO:	Thank you for using BUSCO! Anonymous usage data is gathered to improve the tool. You may opt out with --opt-out-run-stats.
mkdir: cannot create directory ‘busco’: File exists

CheckM2 (completeness & contamination)¶

In [5]:
conda activate checkm2
cd /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes
#mkdir CheckM2
checkm2 predict --threads 30 --force --input ./DSM158.fasta ./GCF_000012905.2_ASM1290v2_genomic.fasta ./GCF_049434525.1_MWCSPHH2ANNA_genomic.fasta --output-directory ./CheckM2
[10/30/2025 10:17:54 AM] INFO: Running CheckM2 version 1.1.0
[10/30/2025 10:17:54 AM] INFO: Running quality prediction workflow with 30 threads.
[10/30/2025 10:17:55 AM] INFO: Calling genes in 3 bins with 30 threads:
    Finished processing 3 of 3 (100.00%) bins.
[10/30/2025 10:18:31 AM] INFO: Calculating metadata for 3 bins with 30 threads:
    Finished processing 3 of 3 (100.00%) bin metadata.
[10/30/2025 10:18:31 AM] INFO: Annotating input genomes with DIAMOND using 30 threads
[10/30/2025 10:19:09 AM] INFO: Processing DIAMOND output
[10/30/2025 10:19:09 AM] INFO: Predicting completeness and contamination using ML models.
[10/30/2025 10:19:14 AM] INFO: Parsing all results and constructing final output table.
[10/30/2025 10:19:14 AM] INFO: CheckM2 finished successfully.

Pangenome construction using ppanggolin¶

Prepare the gff.list file for ppanggolin

In [7]:
cd /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/Ppanggolin/
rm genomes.gbff.list
touch genomes.gff.list
cd ../genomes/Annot
for i in *.gbff *.gbk; do echo -e  $i'\t'$(pwd)/$i >> ../../Ppanggolin/genomes.gbff.list; done
cd ../../Ppanggolin
cat --show-tabs genomes.gbff.list
GCF_000012905.2.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_000012905.2.gbff
GCF_000015985.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_000015985.1.gbff
GCF_000021005.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_000021005.1.gbff
GCF_000212605.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_000212605.1.gbff
GCF_000269625.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_000269625.1.gbff
GCF_000273405.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_000273405.1.gbff
GCF_001576595.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_001576595.1.gbff
GCF_001685625.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_001685625.1.gbff
GCF_002706325.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_002706325.1.gbff
GCF_003324715.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_003324715.1.gbff
GCF_003846365.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_003846365.1.gbff
GCF_003846385.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_003846385.1.gbff
GCF_003846405.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_003846405.1.gbff
GCF_003846425.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_003846425.1.gbff
GCF_012647365.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_012647365.1.gbff
GCF_049434525.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_049434525.1.gbff
GCF_052246835.1.gbff^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/GCF_052246835.1.gbff
DSM158.gbk^I/home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/genomes/Annot/DSM158.gbk

Calculate the pangenome

In [9]:
cd /home/Drives/HDD03_06T_SDE/anna/SyntenyPlotDSMSubH2/Ppanggolin/
conda activate ppanggolin
ppanggolin workflow -f -o ./output --cpu 20 --anno genomes.gbff.list
conda deactivate
2025-10-30 10:19:52 main.py:l146 INFO	Command: /home/jupyter-anna/.conda/envs/ppanggolin/bin/ppanggolin workflow -f -o ./output --cpu 20 --anno genomes.gbff.list
2025-10-30 10:19:52 main.py:l147 INFO	PPanGGOLiN version: 1.0.13
2025-10-30 10:19:52 annotate.py:l309 INFO	Reading genomes.gbff.list the list of organism files ...
Processing DSM158.gbk: 100%|███████| 18/18 [00:09<00:00,  1.95annotation file/s]
2025-10-30 10:20:01 writeBinaries.py:l387 INFO	Writing genome annotations...
100%|███████████████████████████████████████| 18/18 [00:00<00:00, 92.12genome/s]
2025-10-30 10:20:01 writeBinaries.py:l400 INFO	writing the protein coding gene dna sequences
100%|███████████████████████████████| 77147/77147 [00:00<00:00, 105161.82gene/s]
2025-10-30 10:20:02 writeBinaries.py:l426 INFO	Done writing the pangenome. It is in file : output/pangenome.h5
2025-10-30 10:20:02 cluster.py:l158 INFO	Writing all of the CDS sequences for clustering...
100%|███████████████████████████████| 77147/77147 [00:00<00:00, 323343.87gene/s]
2025-10-30 10:20:02 cluster.py:l201 INFO	Clustering all of the genes sequences...
2025-10-30 10:20:02 cluster.py:l45 INFO	Creating sequence database...
2025-10-30 10:20:03 cluster.py:l54 INFO	Clustering sequences...
2025-10-30 10:20:05 cluster.py:l56 INFO	Extracting cluster representatives...
2025-10-30 10:20:05 cluster.py:l68 INFO	Writing gene to family informations
2025-10-30 10:20:06 cluster.py:l148 INFO	Adding protein sequences to the gene families
2025-10-30 10:20:06 cluster.py:l130 INFO	Adding 77147 genes to the gene families
100%|███████████████████████████████| 77147/77147 [00:00<00:00, 230322.64gene/s]
2025-10-30 10:20:06 makeGraph.py:l56 INFO	Computing the neighbors graph...
Processing DSM158.gbk: 100%|██████████████| 18/18 [00:00<00:00, 42.71organism/s]
2025-10-30 10:20:07 makeGraph.py:l74 INFO	Done making the neighbors graph.
2025-10-30 10:20:07 partition.py:l349 INFO	Estimating the optimal number of partitions...
100%|███████████████| 19/19 [00:01<00:00, 11.59Number of number of partitions/s]
2025-10-30 10:20:08 partition.py:l351 INFO	The number of partitions has been evaluated at 3
2025-10-30 10:20:08 partition.py:l369 INFO	Partitioning...
2025-10-30 10:20:09 partition.py:l429 INFO	Partitionned 18 genomes in 0.31 seconds.
2025-10-30 10:20:09 writeBinaries.py:l405 INFO	Writing gene families and gene associations...
100%|██████████████████████████| 6224/6224 [00:00<00:00, 107409.53gene family/s]
2025-10-30 10:20:09 writeBinaries.py:l407 INFO	Writing gene families information...
100%|██████████████████████████| 6224/6224 [00:00<00:00, 312541.58gene family/s]
2025-10-30 10:20:09 writeBinaries.py:l414 INFO	Writing the edges...
100%|██████████████████████████████████| 7398/7398 [00:00<00:00, 97064.43edge/s]
2025-10-30 10:20:09 writeBinaries.py:l328 INFO	Updating gene families with partition information
100%|██████████████████████████| 6224/6224 [00:00<00:00, 195830.26gene family/s]
2025-10-30 10:20:09 writeBinaries.py:l426 INFO	Done writing the pangenome. It is in file : output/pangenome.h5
2025-10-30 10:20:09 tile_plot.py:l38 INFO	Drawing the tile plot...
2025-10-30 10:20:09 tile_plot.py:l54 INFO	start with matrice
2025-10-30 10:20:09 tile_plot.py:l69 INFO	done with making the dendrogram to order the organisms on the plot
2025-10-30 10:20:09 tile_plot.py:l104 INFO	Getting the gene name(s) and the number for each tile of the plot ...
2025-10-30 10:20:09 tile_plot.py:l113 INFO	Done extracting names and numbers. Making the heatmap ...
2025-10-30 10:20:10 tile_plot.py:l169 INFO	Drawing the figure itself...
2025-10-30 10:20:12 tile_plot.py:l171 INFO	Done with the tile plot : './output/tile_plot.html' 
2025-10-30 10:20:12 ucurve.py:l13 INFO	Drawing the U-shaped curve...
2025-10-30 10:20:13 ucurve.py:l60 INFO	Done drawing the U-shaped curve : './output/Ushaped_plot.html'
2025-10-30 10:20:13 writeFlat.py:l225 INFO	Writing the .csv file ...
2025-10-30 10:20:13 writeFlat.py:l281 INFO	Writing the gene presence absence file ...
2025-10-30 10:20:13 writeFlat.py:l213 INFO	Writing the gexf file for the pangenome graph...
2025-10-30 10:20:13 writeFlat.py:l213 INFO	Writing the light gexf file for the pangenome graph...
2025-10-30 10:20:13 writeFlat.py:l421 INFO	Writing the projection files...
2025-10-30 10:20:13 writeFlat.py:l304 INFO	Writing pangenome statistics...
2025-10-30 10:20:13 writeFlat.py:l305 INFO	Writing statistics on persistent duplication...
2025-10-30 10:20:13 writeFlat.py:l106 INFO	Writing the json file for the pangenome graph...
2025-10-30 10:20:13 writeFlat.py:l430 INFO	Writing the list of gene families for each partitions...
2025-10-30 10:20:13 writeFlat.py:l301 INFO	Done writing the gene presence absence file : './output/gene_presence_absence.Rtab'
2025-10-30 10:20:13 writeFlat.py:l456 INFO	Done writing the list of gene families for each partition
2025-10-30 10:20:13 writeFlat.py:l326 INFO	Done writing stats on persistent duplication
2025-10-30 10:20:13 writeFlat.py:l327 INFO	Writing genome per genome statistics (completeness and counts)...
2025-10-30 10:20:13 writeFlat.py:l389 INFO	Done writing genome per genome statistics
2025-10-30 10:20:13 writeFlat.py:l278 INFO	Done writing the matrix : './output/matrix.csv'
2025-10-30 10:20:13 writeFlat.py:l222 INFO	Done writing the gexf file : './output/pangenomeGraph_light.gexf'
2025-10-30 10:20:13 writeFlat.py:l222 INFO	Done writing the gexf file : './output/pangenomeGraph.gexf'
2025-10-30 10:20:13 writeFlat.py:l427 INFO	Done writing the projection files
2025-10-30 10:20:14 writeFlat.py:l113 INFO	Done writing the json file : './output/pangenomeGraph.json'
Genes : 77147
Organisms : 18
Families : 6224
Edges : 7398
Persistent ( min:0.67, max:1.0, sd:0.04, mean:0.99 ): 3708
Shell ( min:0.33, max:0.83, sd:0.1, mean:0.59 ): 538
Cloud ( min:0.06, max:0.61, sd:0.1, mean:0.12 ): 1978
Number of partitions : 3