13:06:54 So I'm speaking to you from Europe, so it's a good night here, and I'm happy to I participated in some of the talks last week and, of
13:06:59 course, some of the speakers I know well. And I'll talk about a statistic, some of the statistical approaches, but which are linked to the problem solve.
13:07:14 Communities microbial communities. I've been working in this field for a while when we first started working on the microbiome we were doing Sanger sequencing just to say so long time ago, was about 15 years.
13:07:23 And there are many multivariate statistical problems that come up and I'm going to talk about something slightly simpler.
13:07:30 And that's the problem of finding biomarkers, and I'll explain a little bit what my work is motivated motivated by, and my different approaches. But, don't hesitate, if I say things which you don't understand because I'm using statistical jargon.
13:07:51 Just, you know you can leap in and ask questions I won't see you because I've shut up of all the video except the slides but you can still try to intervene.
13:08:01 If I hear somebody Oh, I'll stop.
13:08:05 Is that OK.
13:08:07 Alright, So, the problem I'm going to talk about has to do with a new types of data that we've been generating.
13:08:15 And this new type of data was motivated by the fact that we wanted to work at higher resolution.
13:08:22 So the statistical methods that we want to develop methods which takes into account whole communities and transitions between states of the communities.
13:08:35 We work on cases where we have groups of patients or subjects, which are different. And we want to look at the differences in the microbial communities, the statistical models that we have to use our infinite dimensional, and I'll explain what I mean
13:08:53 by that. And why that occurs. And I going to then develop some good hierarchical models that work well, and show how come we can use these hierarchical statistical models to generate insulin echo data for microbial communities where we can test and evaluate
13:09:17 different procedures, and in particular, I'll show you a problem that comes up for us where we have what we call the strange switching problem, and where we have a quest for improvement in our statistical power.
13:09:30 So the human microbiome I work with arm, David Roman and his lab, and it was funded by NIH formative research grant has to do with particularly perturbations and resilience and antibiotic perturbations mostly we also do colonic clean out and diet perturbations,
13:09:58 and the other set of studies that we work on is where the Gates Foundation, and in particular the via vaginal microbiome research cohort, where we are studying pregnancy and preterm pregnancy in particular.
13:10:08 So the data we work with from a statistical viewpoint is really important to know that the features on predefined. We don't know what the taxes are going to be in the given samples.
13:10:22 We have the problem that we have an equal depth in the different samples and Ben Callahan talked a little bit about this last week, but this leads to difference in the variances of the estimators.
13:10:35 And we have what we call the denominator problem in that, because the denominators aren't the same these normalized ratios that people sometimes use a head or a scholastic, which means that they have these different variances which we need to take into
13:10:51 account. Why do we care about variances. Well, as many of you have a background in, in physics, you will know that uncertainty quantification and being able to give standard errors are really important when you're trying to test hypotheses.
13:11:07 So the uncertainty quantification is an important part of this.
13:11:11 So the parts that we use to try and analyze the data are to use what statisticians called mixtures.
13:11:20 We don't have one parametric population like a normal distribution or simple point song. We have mixtures of distributions with different parameters, and the mixture parameter the parameter, say the lambda in the case of the puzzle follows also itself
13:11:40 a distribution. And that gives an underlying gradient, and distribution of distributions, in some sense, there's actually a theoretical reason, and some physicists in quantum physics now know more about definity but definitive theorem says that in the
13:11:59 case of this type of data that we have which are exchangeable that the data has to be mixtures, so it's not just that it fits well, it's just a theoretical fact, we have latent variable so factors which are very useful from a interpretation viewpoint.
13:12:17 So when we're trying to understand what is changing the dynamics or the evolution of the populations.
13:12:24 We have hidden variables things that we haven't measured in these variables might be pa shore altitude or depth in the case of the ocean, but their latest variables as a statistician who works with a lot of programming, we don't stress about which choice
13:12:43 we make at the beginning because they are not forever. We make reproducible research workflows.
13:12:50 And we'd have to think only about what information is discarded. Because if something isn't retrievable you can, you know, change a tuning parameter and go backwards if he hadn't kept all the sufficient information.
13:13:04 So the first hurdle is that we have, in some sense, an infinite number of pack so boxes or categories or as fees and where does that come from.
13:13:16 And that comes from the work that we did with Ben Callahan in the development of data to which gives high resolution inference from reads. And so if you have 16 s data.
13:13:33 And these are the reasons which is supposed to represent the different reads here that you have at the top, we have a probabilistic model which allows you to decide, given the frequency of the different reads which are real reason which are errors.
13:13:50 And so we d noise these data at high resolution, and we might end up with, say, in this case, I actually have seven plus one, a single turn, and we have a tendency to drop Singleton's and I can talk about that later but it's not very important.
13:14:09 So we actually end up with what we call seven and pecan sequence variants or as fees, and the data themselves become the counts in the different samples of these empty con variants so here is the data with a seven as fees and in the first sample, suppose
13:14:32 that we had, you know, at one reads from ASP one.
13:14:37 So, you see that through this process, the number of features festival wasn't predefined pre specified, and it has a tendency of being much much larger than the number of samples and then has consequences in the statistical model, in particular, statisticians
13:14:57 say that we're going to have a non parametric model. And that just means that we haven't predefined the dimensionality in general with P much larger than then we say we have to use non parametric methods.
13:15:11 Now, it's important and this is a mistake that I see a lot of people making in analyzing the microbiome data to separate the model from the actual observed data, the mock the data themselves on what we collect, it's the reeds, and we use those reeds to
13:15:32 estimate parameters like for instance the prevalence of the different industries that we believe we're in the original sample.
13:15:44 But they are of course going to be biases and problems with the relationship between this, the data and the truth. And we have ways of estimating that true prevalence.
13:16:01 And then from the, the best estimate we have we can make a probabilistic model, from which we generate probability distributions which are used for hypothesis testing or for uncertainty quantification.
13:16:15 Now I'm going to develop this very much further with Wolfgang Hoover we wrote a whole book on the subject. It's called mums statistics from our biology and it's available online.
13:16:25 And so if you're interested in the perspective.
13:16:30 You know, I recommend you take a look.
13:16:33 Now, these as fees that we have in statistics speak we call these count tables contingency tables and the data account data. So they have properties which makes them not normally or Gaussian.
13:16:49 They're not normally distributed.
13:16:53 The sample.
13:16:53 Some here the columns in the contingency tables
13:16:59 random variables that is the success with which we were able to create reason each of the samples is a random process. And we'll see what that the this these randomness in the column sums increases in some sense the uncertainty with which we know the
13:17:19 the true values.
13:17:21 So, the data as I said before, that x matrix which I call the data themselves that counts. The data themselves are not compositional. A lot of people talk about composition ality composition ality applies to the parameters that is the prevalence is the
13:17:40 proportion some to one, the relative, but the data themselves don't have the data accounts, and we often.
13:17:50 It's a big mistake to transform the data immediately into ratios and forget what the denominators were because when you come down the road to doing testing, you don't know what your standard arrows are in anymore so you can't test.
13:18:05 So, after perturbations for instance, in compositional data which was actually first using chemistry and soil biology, where you took one gram of soil and you said ahead of time.
13:18:18 I'm going to look at these 15 elements and I'm going to look at the proportion of the elements in this one gram, the number of categories is fixed ahead of time, and every hole, this one grim is the same hole.
13:18:32 In the case of microbial communities, the amount of bacteria present in samples is going to change under perturbation so for instance I work on antibiotics.
13:18:42 And so we have much less
13:18:46 bacteria present. After the antibiotic has taken place. So that's something which is important to take into account. And so the whole, the one that people saying the compositional data is actually not seeing from sample to sample.
13:19:15 If you're trying to estimate step bias or read numbers, then you need all the, all the counts as the question. Yes, sorry, go ahead.
13:19:24 as well or is it mostly kind of click on.
13:19:35 For the purposes of this talk so you're talking about Amazon type counts here using this as an example for kind of where, for the purpose of this talk, you're mostly going to be talking about like on counts data so like, should we think about metagenomics
13:19:37 There's nothing that I say I say which doesn't apply to metagenomics, I'm going to do all my examples on 16 s met microbiome data 16 s.
13:19:51 But, and meta genomic table is a contingency table, the complexities in the underlying sources of variability in the contingency table in the meta genome, our orders of magnitude bigger because you have the size of the genes, the number of repeats of
13:20:12 the genes, all kinds of extra factors which come in, which I won't have I wouldn't have time to treat. And I could point you to literature, which does that pretty well in particular by Chris quince, but I'm going, I'm going to talk about 16 so yep.
13:20:28 Got it. Thanks.
13:20:31 Okay.
13:20:31 Other questions.
13:20:34 All right.
13:20:36 So the, the, one of the problems that we have these denominators being different means that you have to make transformation of the data to equalize the variance so that some of the counts unknown with, there are large counts the denominator is large.
13:20:55 So you have higher precision and we want will want to have transformations which do that they're actually good ones that we allow justify them in distributional.
13:21:08 Looking at the account data.
13:21:10 So the ASP features are built from the road data, and this is true for many types of all mixed up we work with that is single cell categories and our genomic data, the categories are built from the data themselves their data generated.
13:21:26 So it's important to keep all the pre processing information. And just to show you an example where you see the difference in the sample sizes maybe for some of you who haven't worked a lot with microbiome data.
13:21:38 You can see this sample had 864,000 reads, and this one has more than a million. And so there would change your order of magnitude in the in the numbers of reads now.
13:21:54 To start off with, and I just mentioned Chris quinson he's one of the people who defined one of the first models, which is a bolton boxes multi normal model.
13:22:06 And we're going to have to enrich that balls in boxes model but we can start off with it.
13:22:11 So if you suppose here, the number of boxes is finite, is for, it'll be just the number of as fees and the number of reeds is a random.
13:22:23 If I said it was a random number in each of the samples. So you have a random number of balls dropped into a random boxes, and what you want to do is look at different conditions so you might have a treatment sample and control sample, and you want to
13:22:40 to look at the difference, and prevalence proportion so the difference in those factors. And in particular, what we're often after is to try and find one or two.
13:22:56 As fees or biomarkers which are the most different between the two groups or the two states.
13:23:05 And I'm going to talk about the problem which comes up, We did a lot of work on trying to find how to test the differences and find the species which were the most different, and we realized we came across a problem which was the strange switching problem,
13:23:20 which comes up when you do high resolution D noise thing, and a plane, obtain as fees, then you have strange, switch, switch between samples.
13:23:35 First of all, what I usually suggest as a first approximation is in working with the problems of debt, it doesn't stop you from just taking all the ASP counts in a sample and ranking them from the most prevalent to the least prevalent.
13:23:56 So this is a question I didn't quite understand what you meant by stream switching, could you maybe explain that in greater detail.
13:24:04 Um, I am going to come back to it. So it's very important, and I'll just, I'll say it now and I'll say it again. So, the strange switching problem is when you get me constrain variants data to is very very sensitive to a small changes in in the, in the,
13:24:36 and you have two strains. And in one set of samples. You might have one strain, and we have examples for instance, in the case of Gardner rela in vaginal microbiome, and you have another set of samples, and it's a very very similar strain.
13:24:52 Now, these samples, if it was considered exactly the same box you might find that that God never Ella was significant
13:25:03 predictor of say preterm birth or your groupings right, but because it's split between two SASVS, it decreases the power, and I'll show you on a simulation experiment.
13:25:19 What I mean by that. But it just means that we have and we have it also for the lactobacillus and others. We have two very similar strings. And I think the parallel which works well and I'm going to use later on, is you have strains which is synonymous.
13:25:36 That is actually have the same function, but they don't have exactly the same spelling.
13:25:44 And in your boxes model you're treating them as if they were different, and they won't see a significant decreases the power with which you can see significant differences.
13:25:57 Does that make sense.
13:26:00 Yes, I think we can return because there's got to be a trade off. Yeah, yeah. Well, we're gonna will come will come back but if that's what this is all about.
13:26:10 So that's one of the things that's going to come up. The first thing I was just going to say because it's just the very simple method that doesn't matter about that doesn't care about the normalization that you do and I always recommend to start the analysis
13:26:24 with that is to rank. The as fees down.
13:26:31 According to the most prevalent to the least prevalent and the most prevalent presence here. I'm going to give the most prevalent, a score of five.
13:26:41 And the second most prevalent the score of four. And then here the third one score of three, and then 211 and one. And so you could ask, why did I create ties.
13:26:57 At the bottom of the ranking. So this is called a rank threshold transformation.
13:27:03 And the reason for that is actually at noise level, you kind of all want them to be tied. Now here of course I did a baby example with seven as fees. In general we have something like 2000 or 20,000 sometimes as fees.
13:27:22 And if at the noise level say you had 2000, and actually 1500 of them are at noise level.
13:27:31 And the five, there are 500, which are really there, that you could rank. But if you're looking at the change in ranks. You could have things which jump in noise level, a lot and make you think that it's significant whereas it's not meaningful, so we
13:27:48 So we can't we create ties of the bottom. So, if I take the example of 2000 if I think that out of my 2000 I only have about 500 that are really there.
13:27:59 I give ranked from 500 499 down to one and the other 1500 I give them all once.
13:28:07 And this rank threshold method is robust that is it's, it's monotone it's, you could, it's equivalent to doing log ratios and then ranking or any kind of monotone transformation, except for this noise level.
13:28:24 Ties which is very useful for not getting false positives. So the rent threshold method is the first way of going about it, it loses some information but it's shown to be very robust.
13:28:39 And we did this in the workflow paper.
13:28:43 And we use it for multivariate analysis and various other kinds of analysis is pretty simple. and you don't have to make any hypotheses about the distribution of the reeds and the biases, for instance, you're only ranking within each sample.
13:29:02 Can I ask a question, yes.
13:29:04 On your previous slide.
13:29:05 So in terms of collapsing things within noise shouldn't have also then put 328 and 332 and the same been, like, within some sort of square root of an accounting.
13:29:17 Well, we could discuss that, but that was my role.
13:29:22 And let me explain why a difference of one
13:29:27 won't make any difference when you're doing the change so what we're trying to see is if we have two sets of samples types of samples, two groups of samples and the rankings are very different in the two sets of samples for the different as fees, then
13:29:45 we can say that the two groups are very different. So you're looking for big jumps in the rings of as fees across the two sets of samples, a difference of one that's not going to make a big difference.
13:29:59 The big difference that could occur if your noise level is a difference between the 600 and the 1600. You could have a jump of 1000, which looks very significant, which actually everything is within noise.
13:30:18 And so that's why you could you could want to do that but it wouldn't have a big effect it's at the noise level that it matters. Thanks.
13:30:27 Okay, excuse me one more quick question. Um, I really liked this idea of, you know, putting ties on things that aren't noise level.
13:30:35 How would you recommend that we go about estimating where that threshold is it. You know, I like to work with the biologists So usually I point my water pistol at them and I say, how many species Do you think there are in the samples.
13:30:53 And, you know, for instance, we were working on the gut, and I we were looking at different papers which would appear to have trying to point out how many different bacteria you could expect you had lists of say,
13:31:09 10,000, and they were saying, well, we think they should be about 1500 in there. So I put the threshold, you can be a little bit lacks so the 1800 or 2000, but you know that it's not more than that, that are going to be there.
13:31:27 So, you can be it. And this is the beauty of doing reproducible research that is, that's what I call a tuning parameter, it's dependent, a little bit of a prior knowledge, but it's very robust, that is you can go back and do a sensitivity analysis on
13:31:49 it and change it, and it won't change the law, but it, of course, the more you know, the better you do, I mean if you have good prior knowledge about your environment.
13:32:03 Then it's important, I think that, for instance the marine environment it's extremely rich, so you'd expect a much bigger number.
13:32:10 Since you're advocating based on number of features rather than something about the number of reads in the features that are noise Yes, because, because the number of reasons well that there's all these technological biases about how your samples got,
13:32:22 you know the application all of that comes in. So I would say, yes, it's the number of as fees, about, and it only has to be very rough.
13:32:34 But that's right. Thank you. Right.
13:32:38 Okay, so I've done that several times, and we always do it as a first pass on our data, and we find that if there are any low hanging fruit.
13:32:49 This is the best way to do it because it's what we call non parametric, and it's completely robots. So if there's a difference on the threshold ranks.
13:33:00 There, then you're, you're, you're done. You don't have to, you know, revisit or anything so that's a good way of going about it. And then we did more multivariate analysis on these ranks in this paper.
13:33:12 Now I want to give you a few motivating examples about testing between two groups and talk a little bit about some of the work that we've done. And that is the.
13:33:25 There are two sets of papers. One is about the antibiotic perturbations so in this case, we were looking at three patients, this is a first study that was done by David Roman.
13:33:43 And we had 5050 time points on each of three patients, and we wanted to look at the effect of taking antibiotics.
13:33:53 And we had the different species in the first step that we do in this analysis as we create the phylogenetic tree of the barriers, yes fees. This gives us a distance matrix so we look at the distance on the tree.
13:34:08 We take the distance matrix. And we do we do, multi dimensional scaling or PCA, sometimes it's cool, in which you try to find a Euclidean embedding low dimensional usually two dimensional.
13:34:24 And then we can put down the different species points here.
13:34:29 And then the samples are going to be combinations bearing centers or centers of inertia of the species with the weights equal to. The, the frequency of each of the species in the sample, and that gives you waiting points and then you redo a PCA on it,
13:34:52 and the output is something like this, which is shows here. This is in the normal state, and this is in the antibiotic stress state. So here you have the patient.
13:35:05 He and the patient D, they are changes almost parallel in this multi dimensional space, and you have the patient. patient f practically didn't change.
13:35:17 with Roman later on to try and find out which are the patients, can you find markers for resilience. So I'm not going to explain too much in detail how multi dimensional scaling exists.
13:35:36 I wanted the slides to be standalone. So definitely if you're interested in the technique which allows you to integrate the tree is awaiting in this double principal coordinate analysis these slides explain that.
13:35:50 And as I said, this is the description of the data, the two categories where the data, the time points where there was no Cipro and or in between the Cipro.
13:36:06 And then there was those were there was antibiotic stress, where the just after taking Cipro and during this April.
13:36:15 And here we have the first multi dimensional staining I was saying about the taxa, and then these are the three subjects, all of their points as they appear.
13:36:28 And then I'm going to come back to this later on because what we want to do.
13:36:33 When we see something like this, or the one I showed you before, this picture here down here.
13:36:41 We then want to understand what is this direction here of change between each one. And that's where even this multivariate method which took into the tree into account was very unsatisfactory.
13:36:58 And I'm going to revisit this data with a different method, which is much more interpreted.
13:37:03 But that's the, that was the motivation.
13:37:06 So the these community points here are hard to interpret and we're going to look back.
13:37:14 Here's another study, which is another motivation and trying to again trying to look at difference between two groups. We have the antibiotic group and then a non non diabetic within each subject, in this case will be interested in preterm birth.
13:37:29 We have a group of actually we did a study on 40 or 49 pregnant women we use nine women as a validation set. And we had microbiome at many body sites but in this study I'm only going to talk about the vaginal microbiome.
13:37:48 This was a study, which appeared in PNS, and which we replicated with Ben Callahan and David, with a different set of women so it's pretty robust, we're pretty happy with the, with the results.
13:38:06 What we see in these communities,
13:38:11 clearly have two types.
13:38:14 And what happens is, each column here corresponds to a sample, and we have several samples, maybe up to 10 samples from the same women.
13:38:25 And so it might be that all the samples here down here, come from the same woman, and that she has delivered preterm. So whether they delivered preterm or not is
13:38:44 pink and light pink is very preterm.
13:38:48 And then we have the community state types as they're defined from the different prevalence or frequencies of the different bacteria. So like lactobacillus Chris palace is completely characteristic of what we call community state type one.
13:39:07 And you see all the samples here together, and we have few preterm birth but not a lot to NC State type two is dominated by a lactobacillus guests theory, with a little bit of garden there'll also present, and we have in group three it's in earth.
13:39:28 And here you have, for instance, some lactobacillus inner cinco five in some Jensen.
13:39:37 Now the one group, which doesn't have one dominant bacteria is group for, and this is a case. I always like to remind people about this case, because it's the case where we have biodiversity which is pathological.
13:39:55 So that much higher diversity in this community state poor and prep, many many preterm birth.
13:40:05 So, not having a lactobacillus, actually, is associated to preterm birth. And in this case the diversity is the pathology.
13:40:22 Now there within this group, there are actual species which associate, and we went back and revisited this and confirmed it with Ben, and God Nirvana of different strains are associated in some of the communities to preterm birth.
13:40:43 So you can you can build in some sense of CO occurrence or ask a question. Yes, go ahead. I don't understand why you would cluster can ask about clustering, in general, of course, It's really hard to choose the number of clusters.
13:41:00 So if I had a predictive variable, why wouldn't you bottleneck on that on that, or why wouldn't you use.
13:41:07 I'm just trying to share how you chose the number of clusters, I guess, and I think it's an easier problem, if you know you want to find some group that's associated with another variable than it is.
13:41:20 It isn't easy. But we did it in a different way. This is not a supervised method, we didn't try and do a supervised method. But we did have an again this is a case where I didn't tell you the whole story I was dishonest.
13:41:33 There's a previous set of papers, which say there are five community state types in the bedroom microbiome.
13:41:42 So there was already a belief that there were five types.
13:41:47 And so we just clustered according to the the classes are pretty clear.
13:41:57 And then they, they're the presence in the absence, or characteristics so for instance in group five absence complete absence of lactobacillus.
13:42:08 And so you have this exclusivity lactobacillus, and not
13:42:19 say we don't have any of the in here the in us. And another one which is exclusive, which is very targeted touristic also to this group is you see, as soon as you have lactobacillus you have no God in the realm.
13:42:36 And this is a very important point because it's one of the reasons why the standard multiple no meal model doesn't work for bacteria, because one they work to either they work together or they work against each other, but there can be exclusives.
13:42:52 And so a multi normal model you can't have their, their independent, once you, you know, said the only dependency comes from the number of reads that you put in boxes so that's one of the reasons, but in this case these plus they're very clearly.
13:43:09 But we didn't want to cluster.
13:43:13 We wanted to use the previous state types. And then, so plus though, we did. And we clustered using distances, various types of distances mostly breaker just works very well for doing this clustering, and it create it recreates this five clusters.
13:43:34 And we were interested in trying to understand the transitions between states.
13:43:41 So what you can see and maybe that's more.
13:43:45 The question was, you know how stable of the communities, can we make a diagnostic at the beginning of pregnancy. This patient is in risk, and how likely are we to be transitioning.
13:44:01 So let me show you this picture maybe that will be a bit more clear about what we're trying to do, and that was to try to. Here we've covered.
13:44:12 And so we have community state three up here and these are the length of the pregnancy. So you see that there are some spots here, and this pregnancy wasn't sure.
13:44:25 But they went back and forth between blue, and the pink, but as you get down to the preterm pregnancy you see a lot of the dots are red, and even early in the pregnancies.
13:44:42 So the early on in the pregnancy. You could also have told that, you know, this was going to be a problem.
13:44:50 But we did want to use this.
13:44:53 We're going to revisit their song, because I'm going to explain we needed a more flexible model to look at the transitions, but we didn't want to force.
13:45:04 You see there are other elements which make a woman have a baby preterm.
13:45:10 And you we didn't have a hold on all the covariance here, which could have created Peter, so we didn't want to force the data too much to be just to predictive of preterm birth and force it into a model too much.
13:45:26 Does that make sense.
13:45:31 Sure.
13:45:32 Okay, so now I'm going to.
13:45:36 We made a mock up model of the transitions between the different states and we saw that this state for is the most volatile.
13:45:44 And we were trying to explain the transitions out between states interstate four in particular because those are the pathologies, and in the follow up study we've been using something a little bit more flexible than clustering and that's these topic models.
13:46:03 So, the, we did confirm this that is we did the follow up study with Ben Callahan on the temporal and spatial variation. afterwards, and we've been following up now with multi orgasmic data with Laura symbol, and where we see that there's a real benefit
13:46:24 to generalization of clustering. So instead of just doing an ordinary clustering, where each sample belongs to one community state type. We're going to do mixed membership and topic models.
13:46:39 But before I get to topic models, let me talk about mixtures. In general, and because I think it's important to understand the model and the insula co generative models that we use and where they come from.
13:46:54 And so the, the first of all we have a theoretical justification of why these data and mixtures and if you're not a statistician I don't expect you to know this.
13:47:06 There's a very nice review by day economists and Friedman, and it says that ethos samples specimens are exchangeable, and in the statistical sense so that just means that we have complete symmetry.
13:47:19 That is the probability of a certain ordering in the vectors, doesn't change is invariant. So if I have a permutation of these. I also have the same problem at.
13:47:34 So that's exchange of validity, and this implies the existence, the existence of a mixing mayor. And so we have a very natural, what we call hierarchical model.
13:47:46 Now for all particular type of data for our account data.
13:47:51 If you took one sample that you had already amplified. And you did the sequencing, you'd have rhetoric what we call replicant noise which is possible.
13:48:05 If you have, on top of that, batch effects amplification error effects and Biological Variation this possible parameter is going to be variable itself.
13:48:22 So it gives a biological meaning to the hierarchical model where we have this parameter which is changes.
13:48:28 And we model this variation of the lambda, as a gamma distribution. And one of the reasons for doing that is the gamma distribution is what we call the conjugate prior which goes with the person.
13:48:42 And I'm not going to go too much into the details about that but it is sort of a statistical standard way of doing it, but you don't have to believe me, this is just a little theoretical you know pointer because in fact we did goodness of fit on the data
13:48:59 and the goodness of fit tells us that the read data fit, gamma question pretty well.
13:49:10 Now,
13:49:12 I'm presuming that parameter rising this noise model for the data is motivate is is used them for making inferences point estimates or whatever later on.
13:49:23 You guys, my question is, why not use a re sampling scheme rather than parameter eyes the data when this parameter ization will probably be wrong in some cases right yeah but re sampling is much trickier to do in longitudinal data for instance, we have
13:49:41 done a lot of, you know, we use the block bootstrap, and we've done a lot of that, but actually at heart, I'm a Bayesian.
13:49:50 And so as a Basie and, and I will I hope I will prove to you that it's actually more useful to have a generative model than it is to do recently because if you have a generative model you can do all kinds of experiments in silicon.
13:50:09 And so, a generative model that fits is richer and more useful if it's the right model I agree, you know, you have to put in all kinds of sources of variability and maybe some that you don't know in it.
13:50:25 And, but it is a small part, it does give you many more tools, and I will my thesis on bootstrap re sampling. So, you know, it's not that I don't know how to use it, is that when you have very high dependency in the data like you do with this.
13:50:39 You actually surprisingly enough, you don't have enough data, they don't have enough samples.
13:50:47 Yeah. Okay. So, what is a follow up question then. Since
13:50:54 then, why wouldn't I use just a really expressive, you know, generative model so you must there must be some tension because I could just use some very fancy machine learning model but you clearly don't advocate that as well so I would no no no I was
13:51:08 wondering what you what you see as the tension here.
13:51:12 So I.
13:51:14 So first of all, you know, I don't know what. So, for me when you say AI or machine learning and mean black boxes and I don't work with black boxes that is I have to know.
13:51:26 I have a meaning, as I said earlier on. I have a meeting for all the biological meaning for all the parameters that I put in, so it's not a black box where I'm twiddling knobs without knowing what they do.
13:51:41 And so it's definitely the case that I want a generative model which is very explicit and where the parameters I know what what they do and what they mean.
13:51:53 And so,
13:51:54 for the number of years, the number of parameters that you can put in is limited by the amount of data that you have. So, in many cases, if you have what we call an infinite dimensional model that is just means that you have more parameters than you have
13:52:15 samples. Because infinity starts at in for, you know, non parametric methods start after you haven't in samples, then you have to put in extra smoothness constraints or extra conditions.
13:52:31 Because otherwise, you just make something that fit your data perfectly and it doesn't generalize.
13:52:37 So, so there's, there is a tension there.
13:52:40 But what we try to do is to make models which can be easily interpreted. So, my motivation is mostly biological parameters so that we can understand things in.
13:52:53 So what comes out, and later on we'll see what the topics what comes out is biologically meaningful.
13:52:59 It's not only statistically, you know, not only, there's a good goodness of fit there's also that there's parameters in there that have meaning.
13:53:09 But, but there is a tension and definitely creating a generative model which is good enough but not too good, is where the tension is. And so just to show you what a hierarchical model is, it's the idea that here.
13:53:25 if I generate here I have a gamma distribution and I'm going to generate my persona parameter here say I took a pic from this distribution and I got this.
13:53:39 And then I get a poor song with this is its parameter, and I generate one from a person, and I get this, and I put it in my histogram and I started again many many times.
13:53:50 So, in a hierarchical model you have a two step process you you generate the lambda.
13:53:57 From a gamma distribution and then from that you have a distribution with that lambda and you generate a random variable which is possible. So through this two step process you get a distribution.
13:54:12 And it turns out that, and I'm not going to come back to it but it has another name that is if you do a gamma, or so hierarchical model you get a negative binomial probability distribution.
13:54:25 So different parameter ization but it's the same.
13:54:28 And this is used. It's absolutely standard also in our in a sec and many other types of data. It turns out that in microbiome data we actually have a lot of zeros.
13:54:42 And when we started doing this we thought we had to add the zero inflation, we had to do zero inflated negative binomial zero inflated gamma possible, but the gamma distribution provides for the over dispersion in mostly covers the case where you have
13:54:57 a lot of zeros.
13:54:59 And so this work, I first did with joy McMurray it was to fit these negative binomial, and you have to take into account a scaling factor which is a random variable which is associated to the library science.
13:55:17 And then you have the abundance of the data.
13:55:20 The real prevalence in some sense now having a negative binomial or demo song had statistician immediately knows all I know how to make the transformation, which makes that homos get a stick so if you take the, the ark.
13:55:40 Sign hyperbolic of the, of the ACE here, you get something which has the same variants across all the domains so it tells you what the normalization. So that's the first advantage.
13:55:54 Here's the goodness of fit so this is real data which I'm going to talk to you about in a little minute.
13:56:01 And this was data which had to do with riser spheres, and in a recent paper that I wrote with Jenna connectome, and we try to look at data, and all the different ESP reads, and we did tests of whether or not the fit with the negative binomial, or gamma
13:56:25 person was any good. And these are all the P values and you can see they're very, very few small p values, and it's even less p values which are close to zero, than what you'd expect, if you had a uniform distribution here, so it shows the data fit very
13:56:44 well to this gamma. So it's a good mom, the data.
13:56:51 Now just as I said generative models particularly useful for generating in silica data, and especially for evaluating testing procedures and doing comparing different statistics are doing power studies.
13:57:05 Now I'm going to talk a little bit about power because that's what occurs with the stream switching.
13:57:11 Now the power of a statistical procedure is.
13:57:15 It's a special definition is the probability of detecting a difference between the groups when they're actually is one.
13:57:24 And usually when you write up, you know, a grant for the NIH, you have to do a power study in which you show that you have enough samples in order to have a power of about 80%.
13:57:36 And usually you do that by a simulation. You have to do multi kolo experiment to try and see under different hypotheses of effect sizes, what is the power.
13:57:48 And that's what I am going to show you for the strange switching suits here's a quick question. Yes, I think I might have missed something back on the previous slide not yeah exactly so we're saying that the gamma was on model fits the data, well what.
13:58:03 Sorry, maybe I missed what the experiment is Is it the same sample sequence twice or two different samples, what do we actually use to get a good disappear.
13:58:12 So, so I'm talking about.
13:58:16 you take all the reads in your contingency table.
13:58:20 So I showed you the ASP table with the ASP with you have samples.
13:58:28 And you have the.
13:58:33 So here, for instance, the sample of a specimen is unfortunate, the word sample but this, the biological specimen, for instance, is the Jay.
13:58:47 And I might be the index of the ASP.
13:58:52 And then you have repeat, different, you know, different replicates, and you have an ax somehow the expected or the theoretical abundance, or prevalence in some sense, without the normally without the scaling factor which has to do with the depth of the
13:59:15 sample.
13:59:17 And that will give you this product is the mean of the negative binomial and the variances. You have a dispersion parameter that is how far you are off from the persona distribution.
13:59:32 So, you have several parameters and you have parameters which are different. So here I have basically for every cell in the table, I'm going to have different parameters.
13:59:45 So, so I can we get a perfect fit by fitting all the AI JS to exactly match the contingency table.
13:59:53 Well, because I have also repeat so I have tof replicant Okay, so it's not right. I see I see, okay, I do have less parameters that I have data. Right, right, right, right.
14:00:12 So then I'm going to show you again, the example come back to the stream, what I was talking about. Strange switching.
14:00:23 And so we want to look at the difference, and test for differential abundance in different taxa.
14:00:33 And so this can be testing in different environments or treatment groups. And when you do that kind of test you have.
14:00:41 Suppose that you have. I ASP so you have that number of species, and you have the vector which is the treatment and then you have the vector which is the control, and you might want be trying to find, which is the ASP, which has the maximum difference
14:01:00 between them. And is that ASP.
14:01:05 Actually significantly different between the two.
14:01:12 Between the two states.
14:01:13 Okay, you have to do of course correction mock for multiple testing because you're doing this test many many times. And so, power can be important. You're going to have to have quite a small p value.
14:01:28 And what I say is that if the strange say ASB one and the sp 25 are interchangeable between the samples in some samples ASP one appears and ASV 25 dozen, and another samples, we have it the other way around.
14:01:53 And the testing procedures. Well, Miss, if they're significant. And so earlier on, you said well that's not clear maybe if I show you on a little example.
14:02:02 What that does, so this is an example I really example that we took from a paper when you're doing 16 s on in in plan, and the sphere. And they had two groups.
14:02:16 I won't go into too many details but they were using universal plastic peptide together as its own DNA. And then they were modified as Tracy, see modified PPMP p amp A is the other group, and they want to find out whether there's a difference between
14:02:40 two. A ways of treating the samples. Okay so that that was the main. That was the motivation.
14:02:46 And so we took that data, and we had.
14:02:51 There were, I think, 86 root industry and 25 controls specimens and we might have taken out one or two. And we did decontamination of it using a hierarchical model.
14:03:05 And what they've done in their paper and this is what we imitated although this is not my favorite way of doing it but the referees wanted us to replicate exactly what they've done in their paper.
14:03:16 So we took penetration testing over. So what that does is you compute some type of distance between the samples, say, Ray Curtis distances between all the samples.
14:03:31 And then you, you have the two different groups of samples and you do a permutation test where you change the assignments are the samples to the different groups, and you create you do that many many times you create a permutation distribution, which
14:03:47 gives you the normal distribution, that the two groups and all the not different.
14:03:54 And so that you know that's a permutation approach perminova approach, and I'm going to show you how we did the power simulation using our model, the gamma persona model.
14:04:07 And so we generated data, which had negative binomial with the same parameters as those of the data.
14:04:17 And just to show you I made a small example with just eight and eight in the two groups, instead of the original one which is much much larger. And here are the two samples.
14:04:32 And what I want to show you because maybe it's more it's clearer, what's going on in this first data set that we generated here I have strange switching.
14:04:44 That is here when I have a number here I have a zero here. And here I have a zero here I have this one. So it's an either or situation.
14:04:57 And this is the way the strange switching occurs that is their exclusive.
14:05:03 And you see this here.
14:05:06 And so these, this is with strange switching. And this is just a standard generation data without the strange switching. And what we did is we took data with strange switching.
14:05:21 And we looked at the power of the test this particular perminova test.
14:05:29 And, and we found that the power was divided by two.
14:05:33 Not theoretically we just did it by simulation.
14:05:39 And so what happens is you have much less power. If you have this multiplication of strange when in fact they're synonymous. And it destroys in fact the differential abundance that was occurring before.
14:05:46 In the case of the strange switching.
14:05:56 So the power simulation.
14:05:59 And if you, if you go here.
14:06:03 I'm brave enough I'll do it. There we have it.
14:06:08 Here you see exactly you know we did it. We did the strange switching, you can go and look, this is the accompanying material that went with the paper, and I put my site I uploaded my slides, they probably not up yet but you can go in and have a look.
14:06:27 And we have all the code if you want to rerun it, but what I just did was creating this matrix. And then, in the case either of having a split or not.
14:06:39 And then we have this perminova which we run. And when we do the power.
14:06:47 And in the standard case we get 0.5 and in the other case we get with a switching we get 0.25.
14:06:57 So this is just to show you that it, it's possible to do, it's very useful to be able to do in silica experiments, and you can also, you know, see the kind of problems that we've come up against with this with this strange switching, and now I want to
14:07:16 propose a solution. So okay so that we went back to the actual data, and we found that their data set, they had in fact.
14:07:27 Strange switching it was a bit more complicated because one of their strains was split into three different issues.
14:07:36 And so they had strange switching between three and so they weren't seeing the difference because they had this multiplicity of as fees which corresponded to the same cinema synonyms.
14:07:50 So, now I want to say something, ask a question, of course, go ahead.
14:07:56 So with the, with a strange switching go problem go away if we have a higher like level like otu level.
14:08:07 Instead of ASP level.
14:08:11 Not necessarily.
14:08:16 It doesn't necessarily go away because I talked to you about once which is similar, but they're all ones which are synonyms, but in the same way the words synonyms, and so they they replace, but they're not spelled, so, so, so closely.
14:08:38 And I would say that it's part of a bigger problem which is in fact that many communities, it's not the presence and absence of one ASP which is important.
14:08:51 It's a guild of as fees which are doing something, a small community a small group.
14:08:59 And so I'm going to talk about that.
14:09:01 Alright so I'm going to do
14:09:07 come back to this problem about the fact that the bowls and boxes and the multiple meal don't work.
14:09:15 We said that some of the tax out was I exclusive.
14:09:20 The lactobacillus Chris Patterson down the runner for instance, we see a lot of examples of many of you have given talks where you talk about the CO occurrence and central fee.
14:09:31 You have species which need each other.
14:09:34 And so simple multiple Romeo model with independence between the balls and boxes just doesn't work.
14:09:40 And so because of that.
14:09:44 And I say you know some of the strains of bacteria are interchangeable and play the same roles, even if they're not spelled similarly, and so we want to have a much richer model and that's where we're going to use these Layton garishly models, which are
14:10:00 what we call you could also call the mixed membership models. So we talked about clustering and the digital microbiome, and I had just made customers, independent of the response variable, using distances.
14:10:15 And so in clustering every sample was assigned to just one community state type. But suppose that actually within your sample you have a mixture of 52% of one type and 48% of the other.
14:10:36 It's about to make a transition from one state tight to the other, but you don't see it because you've just given the majority rule this mix membership model allows you to put within a single sample.
14:10:50 Several, what we will call topics, which are more flexible than just a cluster category.
14:10:59 And so in a topic mixture model, every sample can be composed of several topics or Guild.
14:11:06 And there's a very useful parallel which is what is done in natural language processing. So if you to think of web pages or books.
14:11:16 The books can be made up of several topics they might be you know books about contracts in sports so they'll be a lot of words which has to do with legal terms, and a lot of words which have to do with games and sports.
14:11:31 So, you pick a topic random about a set number of topics when you're going to make the composition of your sample, and each topic. So, if I take the legal topic, then you have a probability distribution, where you have a lot of words like contracts and
14:11:51 signature and length and so on. And so each topic corresponds to a different probability distribution from any of the words, and you pick a word at random according to the chosen topic.
14:12:06 And so this actually solves a lot of the interpretive ability problem that we were having here.
14:12:13 And so we we applied the link topic model. This is what I did with Chris saccharin.
14:12:20 We apply it to our communities are microbial communities and in particular the my the antibiotic example I'm just going to show you. And so the parallel with natural language processing and topics in natural language is that a topic is equivalent to a
14:12:39 small guild or community.
14:12:42 The words corresponds to the reeds, and then you assign words to a term. So the mother term and that's what you do when you go from the reeds to the different species or tax or using data to for instance.
14:12:58 So a sample is made up of reads in the same way a document is made up of words and a document can be a pamphlet or four or five pages or it could be a book of 350 pages.
14:13:10 It could be a website with only 1000 words. It's the same for the samples.
14:13:17 It doesn't matter that they don't have the same deck that same number of reads in doc in these methods don't care about the normalization, so that's not a problem.
14:13:29 And it's actually something which is very useful about these methods. So here you know a topic and community analysis for us, we'd have subjects and time points say and then we have the reeds.
14:13:42 And then, if you had a book, you'd have the chapters and then here you have you know the names of the words which occur and and they have frequency.
14:13:54 Now, in terms of the statistical model.
14:13:57 It's a hierarchical model, and we have a lady directly Association. It's an alternative to doing multi normal mixture and it allows you to create dependencies between the categories.
14:14:13 And you could have samples who are mixed membership soon it's been used in a lot in mixture studies, Jonathan Pritchard and his co authors Matthew Stevens developed it for genetics in around 2000, David blind, we discovered it in some sense for topic
14:14:31 analysis, we do all the influence using gift sampling.
14:14:35 we do all the influence using good sampling. You know Basie and methods. And we have observed microbiome switch become mixtures of an underlying community type so this is the hierarchical model.
14:14:47 And so we have to parameter that we have to generate from there, jewel of each other in some sense we have observations. We have different community types.
14:14:59 And we have their like the topics, but one sample can have a little bit of the blue one and a little bit of the red one. Okay, so we took the samples can belong to different types.
14:15:14 And then we generate everything according to the Jewish lay, so we have a hierarchical mode. And I won't go into too much I know it's very statistical so probably, you know, you have to withhold disbelief and we can talk about it and you can ask me questions.
14:15:33 But in the end, it is a model which is rich enough that allows us to develop topics or communities, and then we can go about interpreting the data. And so here in the antibiotic data.
14:15:49 This was extremely useful. That is, in the sense that we had a first topic which comes up, which is.
14:15:59 These are the samples, as they the coefficients that you see along this topic.
14:16:06 And these samples are before the antibiotics, this is during the antibiotic.
14:16:12 And we see that three days after the coefficients jumped down.
14:16:18 And then after about a little bit more than a week they go back up in the interim period, they're normal and they go back down again.
14:16:26 During the second time course but you see this resilience build up that is they go back to normal. So this topic. One is the most important topic in antibiotic data, and you can look at the coefficients.
14:16:43 And I'm not going to, you know, evaluate all the topics but it was very meaningful to us, but you can see the coefficients for the different species in topic one.
14:16:55 And you can see that for instance the Women's Caucus, a very prevalent, a very important in topic one, they had this behavior of dropping down a lot the first time and a little bit the second time, all in common, but for instance they don't, they're practically
14:17:15 absent from topic three.
14:17:19 And if you look at what they were saying topic one is where they go down, and then they come back up, and then they go back down but only a little bit.
14:17:30 So we have this commonality of behavior of these different families, and it's not necessarily so we've placed these here, as they would occur on the fighter jet have phylogenetic tree, and we have species, all across the tree, which are behaving.
14:17:53 According to topic one so it's not something which is linked to the evolution of the species.
14:18:02 Sorry, it's a behavior that you see, but by a whole set of different families.
14:18:08 They have a common behavior. So this, these topics were much easier to interpret and much more useful to us than the original multivariate analysis so this was one, one thing that is a real advantage.
14:18:24 And the next question, Susan. Yes. So should I think of this topic model analysis as the sort of analog if you will of PCA for things to live on a simplex like you're trying to choose the four top vectors that can explain.
14:18:41 Most of the data is that is that how I should be thinking about.
14:18:50 I think of the more like mix cluster analysis more like clusters of curves.
14:18:58 Because I think the more
14:19:00 And then the samples behaving, you know, according to these clusters. But, what you see is that within each sample, he can have some of topic one and some of topic three in it.
14:19:18 Should I should I be seeing this is you're also inferring the topics at the same time, we're doing all this. Yes. Okay. Yes.
14:19:27 And sorry one last question, I guess. On the subject of biological and interpret ability so if I understand this correctly the same species can be present in multiple topics.
14:19:38 Right.
14:19:38 That is true. Yes, and so is it making some kind of assumption about, sort of, it plays an independent role in in in different topics like I almost imagine a scenario where the same species if it's in two different topics is actually doing the same thing
14:19:53 and both of them, you know, I'm think of the word play. I always find the parallel with words to be useful. And so if you think of the word play in topic which has to do with culture, it's going to mean something different than a topic where it has to
14:20:16 do with with sports.
14:20:21 So you might have.
14:20:24 So the idea behind the topic analysis is the difficulty in understanding often the microbiome is you have a meaning. So if you take a language, and you look at a sentence that sentence has a meaning, but you could have a whole set of other words, and
14:20:44 still have the same meaning.
14:20:49 And so the idea is to. That's the challenge right you don't have one, only one way of saying things.
14:20:57 And I think that in biology, these communities they have several ways of making you know certain metabolites and they are behaving together. And there are alternatives.
14:21:09 And the problem of, when we're doing differential abundance testing is, we can only account for one way.
14:21:19 And so that's the.
14:21:22 We can't account for the several ways of doing things.
14:21:22 I think that's the, That's what this parallel allows us to do.
14:21:28 And so you have both of things which are synonyms. So different spellings, but their words have the same meaning, and the opposite the same word, having two different behaviors within two different topics.
14:21:44 And I think both of those occur in the biological world as well. So, I think that i think that parallel helps us understand it.
14:21:55 Thank you.
14:21:56 Thanks. Can I ask another question so we played a lot with these LD models with. We have a set of synthetic models that generate synthetic communities and we found the work less good than we would hope on synthetic data, and partly they were very sensitive
14:22:11 to the number of Layton dimensions you had, and the amount of data you had just because we have ground truth because we were generating synthetic communities using some microbial ecological models.
14:22:24 So I was wondering how you dealt with that because we found that to be, we, I think I did those simulations myself so I don't know how much I believe them.
14:22:32 But I found them to know. I found it to be very difficult to make it work in any sensible way on my son works really well.
14:22:42 The places where we found that work, works really well, were situations like the gut, but also the vaginal microbiome which way you have real communities where you have a lot of alternative ways of the dynamics to change.
14:23:01 And so there isn't only one pathway.
14:23:05 And so it's an able to group together things which don't
14:23:11 wish don't give the, you know, a one to one correspondence, in some sense, and I would imagine that if you only had a few.
14:23:19 If you were doing a mock community or very small number of a speech, it wouldn't work at all.
14:23:26 So in this case, you know, it's the equivalent of, you know, words. So, we have thousands and thousands of words and you have a distribution over. And so, in some sense, it has to be infinite dimensional.
14:23:39 We have to have many, many more words than you have.
14:23:43 For it to be worthwhile and to have any robustness. So we found it doesn't work across the board but we found it useful in interpreting very rich messy data, and much more interpreter walls and doing PCA for instance.
14:24:01 But did you try things like VA ease, or, you know, like, things like creation auto and coders work doesn't that doesn't help me. I mean, I have students who work on that but we can't interpret really we have, we have a really hard time interpreting because
14:24:19 it's very sensitive to the tuning parameters.
14:24:22 So we haven't found the VA is the complex ones, you know, with many to be so useful. I like to keep close to having an uncertainty quantification at the end as well.
14:24:34 And that I haven't been able to get out of the, I mean, unless you do a lot of simulation around a well known model but that's not what I want to do in the end.
14:24:48 Can I ask a follow up question to Ben's.
14:24:51 So, am I right in thinking that the reads from a particular ASV could all be in topic one in one sample could all be in topic to in another sample and could be either split between topic one and two in a third sample or be present in both of those topics
14:25:12 equally.
14:25:17 That that wouldn't occur no usually the topics come out as sort of communities or guilds, which are co occurring a lot.
14:25:29 So the same, the ESP one have to always be in the same topic, or they can be in different topics, different time points.
14:25:42 So the.
14:25:44 So the equivalent of the ASP is is like the words right so you could have the same word as I said before, you could have the same word in different topics in different contexts in different samples that's possible.
14:25:59 Yes.
14:26:00 Right. Is it just to go back to your example about the word play. Yeah, you could have a book, which was about drama about sports.
14:26:11 Right. And now it's a little difficult to decide where play is it sort of topics at the same time. Okay, okay. I don't know how to solve that right. Okay.
14:26:23 Thank you.
14:26:26 Okay.
14:26:30 What we call non identifiable but, yes. Okay, yeah.
14:26:38 So, how, if I want to use this method, how many time points do I need at least, and this this, and also similar to like the time series clustering methods.
14:26:47 We call non identifiable but, yes. Okay, yeah. Oh yeah.
14:26:51 What do you mean by the time series clustering method.
14:26:55 So, in the, in the graph you show us for each topic, the patterns, look very similar. So,
14:27:12 the next slide.
14:27:20 Yeah.
14:27:22 Sorry.
14:27:22 So each topic had similar patterns, maybe, or maybe not so so the characterization in some sense of each of the species in topic. One is, they behave.
14:27:39 By, you know, they go down at the first antibiotic perturbation then they come back up and the second antibiotic perturbation is much more. That's the pattern for all the species in topic one.
14:27:52 Yes.
14:27:54 So in some sense the little bit like clustering except as I said, in clustering, you take a sample and you decide it's only one, you know it can only be one topic and here you have a mixed membership model.
14:28:13 And how many time. Time Series points, though I need for this to have significant power.
14:28:21 Well in this case we had 50. It read that question you ask is not well posed because it's not really the number of points time points, it's the number of time points with regards to the variability in the time series.
14:28:37 If you have a time series which has a lot of variation, you need a lot of time points. If you have a time series that has very little variability. You don't need a lot of time points so that, you know, the number of time points depends on the amount of
14:28:54 variants.
14:28:55 Ok, ok.
14:28:58 We have one more question. One question from comedy comedy, go ahead. Hi. Thanks. I was wondering so so this is within one community so what if we want, if we wanted to compare between two communities which has multiple membership would method.
14:29:23 Let me go to my example because that's exactly the problem you're completely right. That is the that's the next thing that I wanted to talk about.
14:29:28 Be was, which which funny.
14:29:33 have strange switching. It's better to do differential topic analysis instead of differential ESP analysis that is ask whether the topics are different across the different groups.
14:29:56 But the first thing that happens, and this is the last row which may be answers your question. I don't know whether that's exactly, but actually, there's a pre protein set.
14:30:12 Once you've got within all the communities, you have to do in alignment, or registration of the topics across the different samples, because you have a topic, and it might be that this is called topic one, and it's topic for in the other one.
14:30:24 And you have to do an alignment, a matching of the topics first.
14:30:29 So that's, that's a very good question, but that's what you do.
14:30:37 But I'm gonna I'm gonna stop further questions so we pretty soon Thank you. Okay.
14:30:42 So, after the alignment of the topics we do a differential topic analysis, and we compute the Basie and posterior the topic proportions in each specimen.
14:30:54 And then we do a differential analysis, the same way that you do a differential abundance for the SPS but there's some topics. And let me just show you.
14:31:02 This is what we got.
14:31:04 And what we see is, in this case, when you didn't. This was an ordinary test between the two samples the p amp L and the MN the original testing that was done.
14:31:20 They didn't find a difference. And we actually found there are two topics topic five and topic, 10, which did have a difference. And I'm not going to go into the details for principle of time.
14:31:33 This one also topic for a little bit, but we did find differences which you couldn't find if you were doing the individual yes fees with the switching problem.
14:31:41 And then if you do paid samples.
14:31:45 Then you have much more power, and many of the different topics were actually seem to be different. And so, in particular, 11, and three, four were definitely differentially abundant in so good in this period study, so that solve that problem.
14:32:11 Okay, so I'm not going to go. I'm not going to take very much more time since it's almost time to finish. And we started a little bit late, and it's, you know, a lot.
14:32:22 And I said, a lot of details, but there is a way actually of extending the topic analysis to a full Basie and non parametric model which is more explicit where you have latent variables which are continuous.
14:32:37 And I'll just point you to the paper, which was a paper that appeared in jazz, which is a Basie and non parametric coordination for the analysis of microbial communities, but it is also a baby and method where we had to do alignment and what you get at
14:32:51 the end I'll just show you the picture is this type of coordination with uncertainty quantification that you get with a base in posterior, and here we have the samples which had very low depth.
14:33:05 And those are the ones which are much more uncertain the other ones.
14:33:09 So you see you in some sense, it has the advantage that you can see the uncertainties.
14:33:15 Okay, so I'm very much in favor of making what we may we do available so it's really important that all of our papers.
14:33:26 You can repeat the analysis the simulation and the amount and analyze the data yourself and see how robust it is to changes, and I put links in the PDF file you see links to the reproducible research supplements of the work that I talked about, and various
14:33:44 kinds of tools, we write all our programs in our, in you bio conductor hosts many of them. And so there are a lot of resources. I would remind you that we in Silicon Valley, we have a Yoda, and he says the premature optimization is the root of all evil
14:34:04 encoding. And I would say that what I see in the microbiome when people analyze the data they do too much premature summarization, and in particular, taking relative abundances at the get go is a big mistake because you're losing a lot of information
14:34:19 about the distribution of depth.
14:34:21 So, what I learned is that we started off, and we didn't understand the raw data and the noise models, and so that was, you know, took a couple of years to do the gamma.
14:34:43 So, realize all the sources of variability that you can change your tuning parameters and the number of topics for instance, sometimes we take more sometimes we take less.
14:34:53 It doesn't you know it's not set in stone there isn't one way of doing these things, hierarchical models the key generators for mixtures, and in particular for generating data in silicone.
14:35:05 And I consider that my analysis only as good as the explanation and the software that goes with it.
14:35:10 And if you have any more questions.
14:35:13 I do, too.
14:35:15 I want to thank my lab group and the people I work with. Chris anchor and did a lot of the work on the topic models and Petipa did the work on the plant and the differential topic analysis.
14:35:28 Okay, so thank you for your patience and.
14:35:35 Thank you.
14:35:35 And if any of you were interested in a practical viewpoint on seeing how this was done, then the analysis of the example, it turns out the people is going to give a workshop, and you actually can still register for it, it costs, $10 to be to go to the
14:35:53 whole bio conductor week, and it'll be at six, I think PST on Wednesday. And so, statistical methods for microbiome data analysis she's going to show it.
14:36:10 So if you're interested. Okay, let's thank the speaker. Thank you.
14:36:21 All right, I think if there are further questions, we'll ask that people take them offline and we'll have some coffee. Thank you Susan. Thanks.