Econstudentlog

Imitation Games – Avi Wigderson

If you wish to skip the introduction the talk starts at 5.20. The talk itself lasts roughly an hour, with the last ca. 20 minutes devoted to Q&A – that part is worth watching as well.

Some links related to the talk below:

Theory of computation.
Turing test.
COMPUTING MACHINERY AND INTELLIGENCE.
Probabilistic encryption & how to play mental poker keeping secret all partial information Goldwasser-Micali82.
Probabilistic algorithm
How To Generate Cryptographically Strong Sequences Of Pseudo-Random Bits (Blum&Micali, 1984)
Randomness extractor
Dense graph
Periodic sequence
Extremal graph theory
Szemerédi’s theorem
Green–Tao theorem
Szemerédi regularity lemma
New Proofs of the Green-Tao-Ziegler Dense Model Theorem: An Exposition
Calibrating Noise to Sensitivity in Private Data Analysis
Generalization in Adaptive Data Analysis and Holdout Reuse
Book: Math and Computation | Avi Wigderson
One-way function
Lattice-based cryptography

August 23, 2021 Posted by | Computer science, Cryptography, Data, Lectures, Mathematics, Science, Statistics | Leave a comment

Designing Fast and Robust Learning Algorithms – Yu Cheng

Some links related to the lecture’s coverage:

Recommender system.
Collaborative filtering.
Matrix completion.
Non-Convex Matrix Completion Against a Semi-Random Adversary (Cheng & Ge, 2018).
Singular value decomposition.
Spectral graph theory.
Spectral Sparsification of Graphs (Spielman & Teng).
Cut (graph theory).
Split (graph theory).
Robust statistics.
Being Robust (in High Dimensions) Can Be Practical (Diakonikolas et al).
High-Dimensional Robust Mean Estimation in Nearly-Linear Time (Cheng, Diakonikolas and Ge).

October 13, 2019 Posted by | Computer science, Lectures, Mathematics, Statistics | Leave a comment

Data science (I?)

I’m not sure if I’ll actually blog this book in detail – I might, later on, but for now I’ll just cover it extremely lazily, by adding links to topics covered which I figured I wanted to include in this post.

The book is ‘okay’ – it’ll both allow (relatively) non-technical (management) people to at least begin to understand what sort of tasks the more technical guys are spending time on (and how to prioritize regarding critical resources, and engage with the nerds!), and it might also give the data guys a few more tools that they’ll be able to use when confronted with a specific issue. I really liked the book’s emphasis on conceptualizing data as a strategic asset. On the other hand I imagine some parts of the book will often be close to painful to read for people who have spent at least a few semesters dealing with stats-related topics in the past: This is the sort of book which is also at least in part written for people who might not be completely clear on what a statistical hypothesis test is, which discusses text mining without at any point in the coverage even mentioning the existence of regular expressions, and which discusses causal evaluation without mentioning topics like IV estimation.

Although there are some major gaps in the coverage the level of coverage is however not really all that bad; I hope to refer to at least some of the more technical material included in the book in my work in the future, but it’s not clear at this point how relevant this stuff’ll actually end up being long-term.

Links (…in random order, I did not have the book in front of me as I was writing this post so this is just a collection of links/topics I could recall being potentially worth including here):

Training, validation, and test sets
Cross-validation (statistics)
Statistical classification
Tree model
Decision tree pruning
Random forest
Naive Bayes classifier
Bigram
n-gram
Data mining
Zipf’s law (not covered, but relevant to some parts of the coverage)
Nearest neighbor search
K-nearest_neighbors_algorithm
Cluster analysis
Jaccard index
Bias–variance tradeoff
Hierarchical clustering
Dendrogram
Boosting (machine learning)
Ensemble learning
Feature (machine learning)
Feature selection
Curse of dimensionality
Regularization (mathematics)
Overfitting
Association rule learning
Labeled data
Dimensionality reduction
Supervised_learning/Unsupervised learning
Model selection
Rubin causal model (not covered, but relevant to some parts of the coverage)
Regression discontinuity design (-ll-)
Lift (data mining)
Receiver operating characteristic
Stepwise regression
Grid search (hyperparameter optimization).

October 4, 2019 Posted by | Books, Mathematics, Statistics | Leave a comment

Learning Phylogeny Through Simple Statistical Genetics

From a brief skim I concluded that a lot of the stuff Patterson talks about in this lecture, particularly in terms of the concepts and methods part (…which, as he also alludes to in his introduction, makes up a substantial proportion of the talk), is included/covered in this Ancient Admixture in Human History paper he coauthored, so if you’re either curious to know more, or perhaps just wondering what the talk might be about, it’s probably worth checking it out. In the latter case I would also recommend perhaps just watching the first few minutes of the talk; he provides a very informative outline of the talk in the first four and a half minutes of the video.

A few other links of relevance:

Martingale (probability theory).
GitHub – DReichLab/AdmixTools.
Human Genome Diversity Project.
Jackknife resampling.
Ancient North Eurasian.
Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans (Raghavan et al, 2014).
General theory for stochastic admixture graphs and F-statistics. This one is only very slightly related to the talk; I came across it while looking for stuff about admixture graphs, a topic he does briefly discuss in the lecture.

July 29, 2019 Posted by | Archaeology, Biology, Genetics, Lectures, Molecular biology, Statistics | Leave a comment

Random stuff

i. Your Care Home in 120 Seconds. Some quotes:

“In order to get an overall estimate of mental power, psychologists have chosen a series of tasks to represent some of the basic elements of problem solving. The selection is based on looking at the sorts of problems people have to solve in everyday life, with particular attention to learning at school and then taking up occupations with varying intellectual demands. Those tasks vary somewhat, though they have a core in common.

Most tests include Vocabulary, examples: either asking for the definition of words of increasing rarity; or the names of pictured objects or activities; or the synonyms or antonyms of words.

Most tests include Reasoning, examples: either determining which pattern best completes the missing cell in a matrix (like Raven’s Matrices); or putting in the word which completes a sequence; or finding the odd word out in a series.

Most tests include visualization of shapes, examples: determining the correspondence between a 3-D figure and alternative 2-D figures; determining the pattern of holes that would result from a sequence of folds and a punch through folded paper; determining which combinations of shapes are needed to fill a larger shape.

Most tests include episodic memory, examples: number of idea units recalled across two or three stories; number of words recalled from across 1 to 4 trials of a repeated word list; number of words recalled when presented with a stimulus term in a paired-associate learning task.

Most tests include a rather simple set of basic tasks called Processing Skills. They are rather humdrum activities, like checking for errors, applying simple codes, and checking for similarities or differences in word strings or line patterns. They may seem low grade, but they are necessary when we try to organise ourselves to carry out planned activities. They tend to decline with age, leading to patchy, unreliable performance, and a tendency to muddled and even harmful errors. […]

A brain scan, for all its apparent precision, is not a direct measure of actual performance. Currently, scans are not as accurate in predicting behaviour as is a simple test of behaviour. This is a simple but crucial point: so long as you are willing to conduct actual tests, you can get a good understanding of a person’s capacities even on a very brief examination of their performance. […] There are several tests which have the benefit of being quick to administer and powerful in their predictions.[..] All these tests are good at picking up illness related cognitive changes, as in diabetes. (Intelligence testing is rarely criticized when used in medical settings). Delayed memory and working memory are both affected during diabetic crises. Digit Symbol is reduced during hypoglycaemia, as are Digits Backwards. Digit Symbol is very good at showing general cognitive changes from age 70 to 76. Again, although this is a limited time period in the elderly, the decline in speed is a notable feature. […]

The most robust and consistent predictor of cognitive change within old age, even after control for all the other variables, was the presence of the APOE e4 allele. APOE e4 carriers showed over half a standard deviation more general cognitive decline compared to noncarriers, with particularly pronounced decline in their Speed and numerically smaller, but still significant, declines in their verbal memory.

It is rare to have a big effect from one gene. Few people carry it, and it is not good to have.

ii. What are common mistakes junior data scientists make?

Apparently the OP had second thoughts about this query so s/he deleted the question and marked the thread nsfw (??? …nothing remotely nsfw in that thread…). Fortunately the replies are all still there, there are quite a few good responses in the thread. I added some examples below:

“I think underestimating the domain/business side of things and focusing too much on tools and methodology. As a fairly new data scientist myself, I found myself humbled during this one project where I had I spent a lot of time tweaking parameters and making sure the numbers worked just right. After going into a meeting about it became clear pretty quickly that my little micro-optimizations were hardly important, and instead there were X Y Z big picture considerations I was missing in my analysis.”

[…]

  • Forgetting to check how actionable the model (or features) are. It doesn’t matter if you have amazing model for cancer prediction, if it’s based on features from tests performed as part of the post-mortem. Similarly, predicting account fraud after the money has been transferred is not going to be very useful.

  • Emphasis on lack of understanding of the business/domain.

  • Lack of communication and presentation of the impact. If improving your model (which is a quarter of the overall pipeline) by 10% in reducing customer churn is worth just ~100K a year, then it may not be worth putting into production in a large company.

  • Underestimating how hard it is to productionize models. This includes acting on the models outputs, it’s not just “run model, get score out per sample”.

  • Forgetting about model and feature decay over time, concept drift.

  • Underestimating the amount of time for data cleaning.

  • Thinking that data cleaning errors will be complicated.

  • Thinking that data cleaning will be simple to automate.

  • Thinking that automation is always better than heuristics from domain experts.

  • Focusing on modelling at the expense of [everything] else”

“unhealthy attachments to tools. It really doesn’t matter if you use R, Python, SAS or Excel, did you solve the problem?”

“Starting with actual modelling way too soon: you’ll end up with a model that’s really good at answering the wrong question.
First, make sure that you’re trying to answer the right question, with the right considerations. This is typically not what the client initially told you. It’s (mainly) a data scientist’s job to help the client with formulating the right question.”

iii. Some random wikipedia links: Ottoman–Habsburg wars. Planetshine. Anticipation (genetics). Cloze test. Loop quantum gravity. Implicature. Starfish Prime. Stall (fluid dynamics). White Australia policy. Apostatic selection. Deimatic behaviour. Anti-predator adaptation. Lefschetz fixed-point theorem. Hairy ball theorem. Macedonia naming dispute. Holevo’s theorem. Holmström’s theorem. Sparse matrix. Binary search algorithm. Battle of the Bismarck Sea.

iv. 5-HTTLPR: A Pointed Review. This one is hard to quote, you should read all of it. I did however decide to add a few quotes from the post, as well as a few quotes from the comments:

“…what bothers me isn’t just that people said 5-HTTLPR mattered and it didn’t. It’s that we built whole imaginary edifices, whole castles in the air on top of this idea of 5-HTTLPR mattering. We “figured out” how 5-HTTLPR exerted its effects, what parts of the brain it was active in, what sorts of things it interacted with, how its effects were enhanced or suppressed by the effects of other imaginary depression genes. This isn’t just an explorer coming back from the Orient and claiming there are unicorns there. It’s the explorer describing the life cycle of unicorns, what unicorns eat, all the different subspecies of unicorn, which cuts of unicorn meat are tastiest, and a blow-by-blow account of a wrestling match between unicorns and Bigfoot.

This is why I start worrying when people talk about how maybe the replication crisis is overblown because sometimes experiments will go differently in different contexts. The problem isn’t just that sometimes an effect exists in a cold room but not in a hot room. The problem is more like “you can get an entire field with hundreds of studies analyzing the behavior of something that doesn’t exist”. There is no amount of context-sensitivity that can help this. […] The problem is that the studies came out positive when they shouldn’t have. This was a perfectly fine thing to study before we understood genetics well, but the whole point of studying is that, once you have done 450 studies on something, you should end up with more knowledge than you started with. In this case we ended up with less. […] I think we should take a second to remember that yes, this is really bad. That this is a rare case where methodological improvements allowed a conclusive test of a popular hypothesis, and it failed badly. How many other cases like this are there, where there’s no geneticist with a 600,000 person sample size to check if it’s true or not? How many of our scientific edifices are built on air? How many useless products are out there under the guise of good science? We still don’t know.”

A few more quotes from the comment section of the post:

“most things that are obviously advantageous or deleterious in a major way aren’t gonna hover at 10%/50%/70% allele frequency.

Population variance where they claim some gene found in > [non trivial]% of the population does something big… I’ll mostly tend to roll to disbelieve.

But if someone claims a family/village with a load of weirdly depressed people (or almost any other disorder affecting anything related to the human condition in any horrifying way you can imagine) are depressed because of a genetic quirk… believable but still make sure they’ve confirmed it segregates with the condition or they’ve got decent backing.

And a large fraction of people have some kind of rare disorder […]. Long tail. Lots of disorders so quite a lot of people with something odd.

It’s not that single variants can’t have a big effect. It’s that really big effects either win and spread to everyone or lose and end up carried by a tiny minority of families where it hasn’t had time to die out yet.

Very few variants with big effect sizes are going to be half way through that process at any given time.

Exceptions are

1: mutations that confer resistance to some disease as a tradeoff for something else […] 2: Genes that confer a big advantage against something that’s only a very recent issue.”

“I think the summary could be something like:
A single gene determining 50% of the variance in any complex trait is inherently atypical, because variance depends on the population plus environment and the selection for such a gene would be strong, rapidly reducing that variance.
However, if the environment has recently changed or is highly variable, or there is a trade-off against adverse effects it is more likely.
Furthermore – if the test population is specifically engineered to target an observed trait following an apparently Mendelian inheritance pattern – such as a family group or a small genetically isolated population plus controls – 50% of the variance could easily be due to a single gene.”

v. Less research is needed.

“The most over-used and under-analyzed statement in the academic vocabulary is surely “more research is needed”. These four words, occasionally justified when they appear as the last sentence in a Masters dissertation, are as often to be found as the coda for a mega-trial that consumed the lion’s share of a national research budget, or that of a Cochrane review which began with dozens or even hundreds of primary studies and progressively excluded most of them on the grounds that they were “methodologically flawed”. Yet however large the trial or however comprehensive the review, the answer always seems to lie just around the next empirical corner.

With due respect to all those who have used “more research is needed” to sum up months or years of their own work on a topic, this ultimate academic cliché is usually an indicator that serious scholarly thinking on the topic has ceased. It is almost never the only logical conclusion that can be drawn from a set of negative, ambiguous, incomplete or contradictory data.” […]

“Here is a quote from a typical genome-wide association study:

“Genome-wide association (GWA) studies on coronary artery disease (CAD) have been very successful, identifying a total of 32 susceptibility loci so far. Although these loci have provided valuable insights into the etiology of CAD, their cumulative effect explains surprisingly little of the total CAD heritability.”  [1]

The authors conclude that not only is more research needed into the genomic loci putatively linked to coronary artery disease, but that – precisely because the model they developed was so weak – further sets of variables (“genetic, epigenetic, transcriptomic, proteomic, metabolic and intermediate outcome variables”) should be added to it. By adding in more and more sets of variables, the authors suggest, we will progressively and substantially reduce the uncertainty about the multiple and complex gene-environment interactions that lead to coronary artery disease. […] We predict tomorrow’s weather, more or less accurately, by measuring dynamic trends in today’s air temperature, wind speed, humidity, barometric pressure and a host of other meteorological variables. But when we try to predict what the weather will be next month, the accuracy of our prediction falls to little better than random. Perhaps we should spend huge sums of money on a more sophisticated weather-prediction model, incorporating the tides on the seas of Mars and the flutter of butterflies’ wings? Of course we shouldn’t. Not only would such a hyper-inclusive model fail to improve the accuracy of our predictive modeling, there are good statistical and operational reasons why it could well make it less accurate.”

vi. Why software projects take longer than you think – a statistical model.

Anyone who built software for a while knows that estimating how long something is going to take is hard. It’s hard to come up with an unbiased estimate of how long something will take, when fundamentally the work in itself is about solving something. One pet theory I’ve had for a really long time, is that some of this is really just a statistical artifact.

Let’s say you estimate a project to take 1 week. Let’s say there are three equally likely outcomes: either it takes 1/2 week, or 1 week, or 2 weeks. The median outcome is actually the same as the estimate: 1 week, but the mean (aka average, aka expected value) is 7/6 = 1.17 weeks. The estimate is actually calibrated (unbiased) for the median (which is 1), but not for the the mean.

A reasonable model for the “blowup factor” (actual time divided by estimated time) would be something like a log-normal distribution. If the estimate is one week, then let’s model the real outcome as a random variable distributed according to the log-normal distribution around one week. This has the property that the median of the distribution is exactly one week, but the mean is much larger […] Intuitively the reason the mean is so large is that tasks that complete faster than estimated have no way to compensate for the tasks that take much longer than estimated. We’re bounded by 0, but unbounded in the other direction.”

I like this way to conceptually frame the problem, and I definitely do not think it only applies to software development.

“I filed this in my brain under “curious toy models” for a long time, occasionally thinking that it’s a neat illustration of a real world phenomenon I’ve observed. But surfing around on the interwebs one day, I encountered an interesting dataset of project estimation and actual times. Fantastic! […] The median blowup factor turns out to be exactly 1x for this dataset, whereas the mean blowup factor is 1.81x. Again, this confirms the hunch that developers estimate the median well, but the mean ends up being much higher. […]

If my model is right (a big if) then here’s what we can learn:

  • People estimate the median completion time well, but not the mean.
  • The mean turns out to be substantially worse than the median, due to the distribution being skewed (log-normally).
  • When you add up the estimates for n tasks, things get even worse.
  • Tasks with the most uncertainty (rather the biggest size) can often dominate the mean time it takes to complete all tasks.”

vii. Attraction inequality and the dating economy.

“…the relentless focus on inequality among politicians is usually quite narrow: they tend to consider inequality only in monetary terms, and to treat “inequality” as basically synonymous with “income inequality.” There are so many other types of inequality that get air time less often or not at all: inequality of talent, height, number of friends, longevity, inner peace, health, charm, gumption, intelligence, and fortitude. And finally, there is a type of inequality that everyone thinks about occasionally and that young single people obsess over almost constantly: inequality of sexual attractiveness. […] One of the useful tools that economists use to study inequality is the Gini coefficient. This is simply a number between zero and one that is meant to represent the degree of income inequality in any given nation or group. An egalitarian group in which each individual has the same income would have a Gini coefficient of zero, while an unequal group in which one individual had all the income and the rest had none would have a Gini coefficient close to one. […] Some enterprising data nerds have taken on the challenge of estimating Gini coefficients for the dating “economy.” […] The Gini coefficient for [heterosexual] men collectively is determined by [-ll-] women’s collective preferences, and vice versa. If women all find every man equally attractive, the male dating economy will have a Gini coefficient of zero. If men all find the same one woman attractive and consider all other women unattractive, the female dating economy will have a Gini coefficient close to one.”

“A data scientist representing the popular dating app “Hinge” reported on the Gini coefficients he had found in his company’s abundant data, treating “likes” as the equivalent of income. He reported that heterosexual females faced a Gini coefficient of 0.324, while heterosexual males faced a much higher Gini coefficient of 0.542. So neither sex has complete equality: in both cases, there are some “wealthy” people with access to more romantic experiences and some “poor” who have access to few or none. But while the situation for women is something like an economy with some poor, some middle class, and some millionaires, the situation for men is closer to a world with a small number of super-billionaires surrounded by huge masses who possess almost nothing. According to the Hinge analyst:

On a list of 149 countries’ Gini indices provided by the CIA World Factbook, this would place the female dating economy as 75th most unequal (average—think Western Europe) and the male dating economy as the 8th most unequal (kleptocracy, apartheid, perpetual civil war—think South Africa).”

Btw., I’m reasonably certain “Western Europe” as most people think of it is not average in terms of Gini, and that half-way down the list should rather be represented by some other region or country type, like, say Mongolia or Bulgaria. A brief look at Gini lists seemed to support this impression.

Quartz reported on this finding, and also cited another article about an experiment with Tinder that claimed that that “the bottom 80% of men (in terms of attractiveness) are competing for the bottom 22% of women and the top 78% of women are competing for the top 20% of men.” These studies examined “likes” and “swipes” on Hinge and Tinder, respectively, which are required if there is to be any contact (via messages) between prospective matches. […] Yet another study, run by OkCupid on their huge datasets, found that women rate 80 percent of men as “worse-looking than medium,” and that this 80 percent “below-average” block received replies to messages only about 30 percent of the time or less. By contrast, men rate women as worse-looking than medium only about 50 percent of the time, and this 50 percent below-average block received message replies closer to 40 percent of the time or higher.

If these findings are to be believed, the great majority of women are only willing to communicate romantically with a small minority of men while most men are willing to communicate romantically with most women. […] It seems hard to avoid a basic conclusion: that the majority of women find the majority of men unattractive and not worth engaging with romantically, while the reverse is not true. Stated in another way, it seems that men collectively create a “dating economy” for women with relatively low inequality, while women collectively create a “dating economy” for men with very high inequality.”

I think the author goes a bit off the rails later in the post, but the data is interesting. It’s however important keeping in mind in contexts like these that sexual selection pressures apply at multiple levels, not just one, and that partner preferences can be non-trivial to model satisfactorily; for example as many women have learned the hard way, males may have very different standards for whom to a) ‘engage with romantically’ and b) ‘consider a long-term partner’.

viii. Flipping the Metabolic Switch: Understanding and Applying Health Benefits of Fasting.

“Intermittent fasting (IF) is a term used to describe a variety of eating patterns in which no or few calories are consumed for time periods that can range from 12 hours to several days, on a recurring basis. Here we focus on the physiological responses of major organ systems, including the musculoskeletal system, to the onset of the metabolic switch – the point of negative energy balance at which liver glycogen stores are depleted and fatty acids are mobilized (typically beyond 12 hours after cessation of food intake). Emerging findings suggest the metabolic switch from glucose to fatty acid-derived ketones represents an evolutionarily conserved trigger point that shifts metabolism from lipid/cholesterol synthesis and fat storage to mobilization of fat through fatty acid oxidation and fatty-acid derived ketones, which serve to preserve muscle mass and function. Thus, IF regimens that induce the metabolic switch have the potential to improve body composition in overweight individuals. […] many experts have suggested IF regimens may have potential in the treatment of obesity and related metabolic conditions, including metabolic syndrome and type 2 diabetes.()”

“In most studies, IF regimens have been shown to reduce overall fat mass and visceral fat both of which have been linked to increased diabetes risk.() IF regimens ranging in duration from 8 to 24 weeks have consistently been found to decrease insulin resistance.(, , , , , , , , , ) In line with this, many, but not all,() large-scale observational studies have also shown a reduced risk of diabetes in participants following an IF eating pattern.”

“…we suggest that future randomized controlled IF trials should use biomarkers of the metabolic switch (e.g., plasma ketone levels) as a measure of compliance and the magnitude of negative energy balance during the fasting period. It is critical for this switch to occur in order to shift metabolism from lipidogenesis (fat storage) to fat mobilization for energy through fatty acid β-oxidation. […] As the health benefits and therapeutic efficacies of IF in different disease conditions emerge from RCTs, it is important to understand the current barriers to widespread use of IF by the medical and nutrition community and to develop strategies for broad implementation. One argument against IF is that, despite the plethora of animal data, some human studies have failed to show such significant benefits of IF over CR [Calorie Restriction].() Adherence to fasting interventions has been variable, some short-term studies have reported over 90% adherence,() whereas in a one year ADMF study the dropout rate was 38% vs 29% in the standard caloric restriction group.()”

ix. Self-repairing cells: How single cells heal membrane ruptures and restore lost structures.

June 2, 2019 Posted by | Astronomy, Biology, Data, Diabetes, Economics, Evolutionary biology, Genetics, Geography, History, Mathematics, Medicine, Physics, Psychology, Statistics, Wikipedia | Leave a comment

Reproducible, Reusable, and Robust Reinforcement Learning

This pdf was created some time before the lecture took place, but it seems to contains all the slides included in the lecture – so if you want a short version of the talk I guess you can read that. I’ve added a few other lecture-relevant links below.

REPRODUCIBILITY, REPLICABILITY, AND GENERALIZATION IN THE SOCIAL, BEHAVIORAL, AND ECONOMIC SCIENCES (Bollen et al. 2015).
1,500 scientists lift the lid on reproducibility (Nature).
Reinforcement learning.
AlphaGo. Libratus.
Adaptive control of epileptiform excitability in an in vitro model of limbic seizures (Panuccio, Guez, Vincent, Avoli and Pineau, 2013)
Deep Reinforcement Learning that Matters (Henderson et al, 2019).
Policy gradient methods.
Hyperparameter (machine learning).
Transfer learning.

May 1, 2019 Posted by | Computer science, Data, Lectures, Statistics | Leave a comment

Artificial intelligence (I?)

This book was okay, but nothing all that special. In my opinion there’s too much philosophy and similar stuff in there (‘what does intelligence really mean anyway?’), and the coverage isn’t nearly as focused on technological aspects as e.g. Winfield’s (…in my opinion better…) book from the same series on robotics (which I covered here) was; I am certain I’d have liked this book better if it’d provided a similar type of coverage as did Winfield, but it didn’t. However it’s far from terrible and I liked the authors skeptical approach to e.g. singularitarianism. Below I have added some quotes and links, as usual.

“Artificial intelligence (AI) seeks to make computers do the sorts of things that minds can do. Some of these (e.g. reasoning) are normally described as ‘intelligent’. Others (e.g. vision) aren’t. But all involve psychological skills — such as perception, association, prediction, planning, motor control — that enable humans and animals to attain their goals. Intelligence isn’t a single dimension, but a richly structured space of diverse information-processing capacities. Accordingly, AI uses many different techniques, addressing many different tasks. […] although AI needs physical machines (i.e. computers), it’s best thought of as using what computer scientists call virtual machines. A virtual machine isn’t a machine depicted in virtual reality, nor something like a simulated car engine used to train mechanics. Rather, it’s the information-processing system that the programmer has in mind when writing a program, and that people have in mind when using it. […] Virtual machines in general are comprised of patterns of activity (information processing) that exist at various levels. […] the human mind can be understood as the virtual machine – or rather, the set of mutually interacting virtual machines, running in parallel […] – that is implemented in the brain. Progress in AI requires progress in defining interesting/useful virtual machines. […] How the information is processed depends on the virtual machine involved. [There are many different approaches.] […] In brief, all the main types of AI were being thought about, and even implemented, by the late 1960s – and in some cases, much earlier than that. […] Neural networks are helpful for modelling aspects of the brain, and for doing pattern recognition and learning. Classical AI (especially when combined with statistics) can model learning too, and also planning and reasoning. Evolutionary programming throws light on biological evolution and brain development. Cellular automata and dynamical systems can be used to model development in living organisms. Some methodologies are closer to biology than to psychology, and some are closer to non-reflective behaviour than to deliberative thought. To understand the full range of mentality, all of them will be needed […]. Many AI researchers [however] don’t care about how minds work: they seek technological efficiency, not scientific understanding. […] In the 21st century, […] it has become clear that different questions require different types of answers”.

“State-of-the-art AI is a many-splendoured thing. It offers a profusion of virtual machines, doing many different kinds of information processing. There’s no key secret here, no core technique unifying the field: AI practitioners work in highly diverse areas, sharing little in terms of goals and methods. […] A host of AI applications exist, designed for countless specific tasks and used in almost every area of life, by laymen and professionals alike. Many outperform even the most expert humans. In that sense, progress has been spectacular. But the AI pioneers weren’t aiming only for specialist systems. They were also hoping for systems with general intelligence. Each human-like capacity they modelled — vision, reasoning, language, learning, and so on — would cover its entire range of challenges. Moreover, these capacities would be integrated when appropriate. Judged by those criteria, progress has been far less impressive. […] General intelligence is still a major challenge, still highly elusive. […] problems can’t always be solved merely by increasing computer power. New problem-solving methods are often needed. Moreover, even if a particular method must succeed in principle, it may need too much time and/or memory to succeed in practice. […] Efficiency is important, too: the fewer the number of computations, the better. In short, problems must be made tractable. There are several basic strategies for doing that. All were pioneered by classical symbolic AI, or GOFAI, and all are still essential today. One is to direct attention to only a part of the search space (the computer’s representation of the problem, within which the solution is assumed to be located). Another is to construct a smaller search space by making simplifying assumptions. A third is to order the search efficiently. Yet another is to construct a different search space, by representing the problem in a new way. These approaches involve heuristics, planning, mathematical simplification, and knowledge representation, respectively. […] Often, the hardest part of AI problem solving is presenting the problem to the system in the first place. […] the information (‘knowledge’) concerned must be presented to the system in a fashion that the machine can understand – in other words, that it can deal with. […] AI’s way of doing this are highly diverse.”

“The rule-baed form of knowledge representation enables programs to be built gradually, as the programmer – or perhaps an AGI system itself – learns more about the domain. A new rule can be added at any time. There’s no need to rewrite the program from scratch. However, there’s a catch. If the new rule isn’t logically consistent with the existing ones, the system won’t always do what it’s supposed to do. It may not even approximate what it’s supposed to do. When dealing with a small set of rules, such logical conflicts are easily avoided, but larger systems are less transparent. […] An alternative form of knowledge representation for concepts is semantic networks […] A semantic network links concepts by semantic relations […] semantic networks aren’t the same thing as neural networks. […] distributed neural networks represent knowledge in a very different way. There, individual concepts are represented not by a single node in a carefully defined associative net, but by the changing patterns of activity across an entire network. Such systems can tolerate conflicting evidence, so aren’t bedevilled by the problems of maintaining logical consistency […] Even a single mind involves distributed cognition, for it integrates many cognitive, motivational, and emotional subsystems […] Clearly, human-level AGI would involve distributed cognition.”

“In short, most human visual achievements surpass today’s AI. Often, AI researchers aren’t clear about what questions to ask. For instance, think about folding a slippery satin dress neatly. No robot can do this (although some can be instructed, step by step, how to fold an oblong terry towel). Or consider putting on a T-shirt: the head must go in first, and not via a sleeve — but why? Such topological problems hardly feature in AI. None of this implies that human-level computer vision is impossible. But achieving it is much more difficult than most people believe. So this is a special case of the fact noted in Chapter 1: that AI has taught us that human minds are hugely richer, and more subtle, than psychologists previously imagined. Indeed, that is the main lesson to be learned from AI. […] Difficult though it is to build a high-performing AI specialist, building an AI generalist is orders of magnitude harder. (Deep learning isn’t the answer: its aficionados admit that ‘new paradigms are needed’ to combine it with complex reasoning — scholarly code for ‘we haven’t got a clue’.) That’s why most AI researchers abandoned that early hope, turning instead to multifarious narrowly defined tasks—often with spectacular success.”

“Some machine learning uses neural networks. But much relies on symbolic AI, supplemented by powerful statistical algorithms. In fact, the statistics really do the work, the GOFAI merely guiding the worker to the workplace. Accordingly, some professionals regard machine learning as computer science and/or statistics —not AI. However, there’s no clear boundary here. Machine learning has three broad types: supervised, unsupervised, and reinforcement learning. […] In supervised learning, the programmer ‘trains’ the system by defining a set of desired outcomes for a range of inputs […], and providing continual feedback about whether it has achieved them. The learning system generates hypotheses about the relevant features. Whenever it classifies incorrectly, it amends its hypothesis accordingly. […] In unsupervised learning, the user provides no desired outcomes or error messages. Learning is driven by the principle that co-occurring features engender expectations that they will co-occur in future. Unsupervised learning can be used to discover knowledge. The programmers needn’t know what patterns/clusters exist in the data: the system finds them for itself […but even though Boden does not mention this fact, caution is most definitely warranted when applying such systems/methods to data (..it remains true that “Truth and true models are not statistically identifiable from data” – as usual, the go-to reference here is Burnham & Anderson)]. Finally, reinforcement learning is driven by analogues of reward and punishment: feedback messages telling the system that what it just did was good or bad. Often, reinforcement isn’t simply binary […] Given various theories of probability, there are many different algorithms suitable for distinct types of learning and different data sets.”

“Countless AI applications use natural language processing (NLP). Most focus on the computer’s ‘understanding’ of language that is presented to it, not on its own linguistic production. That’s because NLP generation is even more difficult than NLP acceptance [I had a suspicion this might be the case before reading the book, but I didn’t know – US]. […] It’s now clear that handling fancy syntax isn’t necessary for summarizing, questioning, or translating a natural-language text. Today’s NLP relies more on brawn (computational power) than on brain (grammatical analysis). Mathematics — specifically, statistics — has overtaken logic, and machine learning (including, but not restricted to, deep learning) has displaced syntactic analysis. […] In modern-day NLP, powerful computers do statistical searches of huge collections (‘corpora’) of texts […] to find word patterns both commonplace and unexpected. […] In general […], the focus is on words and phrases, not syntax. […] Machine-matching of languages from different language groups is usually difficult. […] Human judgements of relevance are often […] much too subtle for today’s NLP. Indeed, relevance is a linguistic/conceptual version of the unforgiving ‘frame problem‘ in robotics […]. Many people would argue that it will never be wholly mastered by a non-human system.”

“[M]any AI research groups are now addressing emotion. Most (not quite all) of this research is theoretically shallow. And most is potentially lucrative, being aimed at developing ‘computer companions’. These are AI systems — some screen-based, some ambulatory robots — designed to interact with people in ways that (besides being practically helpful) are affectively comfortable, even satisfying, for the user. Most are aimed at the elderly and/or disabled, including people with incipient dementia. Some are targeted on babies or infants. Others are interactive ‘adult toys’. […] AI systems can already recognize human emotions in various ways. Some are physiological: monitoring the person’s breathing rate and galvanic skin response. Some are verbal: noting the speaker’s speed and intonation, as well as their vocabulary. And some are visual: analysing their facial expressions. At present, all these methods are relatively crude. The user’s emotions are both easily missed and easily misinterpreted. […] [An] point [point], here [in the development and evaluation of AI], is that emotions aren’t merely feelings. They involve functional, as well as phenomenal, consciousness […]. Specifically, they are computational mechanisms that enable us to schedule competing motives – and without which we couldn’t function. […] If we are ever to achieve AGI, emotions such as anxiety will have to be included – and used.”

[The point made in the book is better made in Aureli et al.‘s book, especially the last chapters to which the coverage in the linked post refer. The point is that emotions enable us to make better decisions, or perhaps even to make a decision in the first place; the emotions we feel in specific contexts will tend not to be even remotely random, rather they will tend to a significant extent to be Nature’s (…and Mr. Darwin’s) attempt to tell us how to handle a specific conflict of interest in the ‘best’ manner. You don’t need to do the math, your forebears did it for you, which is why you’re now …angry, worried, anxious, etc. If you had to do the math every time before you made a decision, you’d be in trouble, and emotions provide a great shortcut in many contexts. The potential for such short-cuts seems really important if you want an agent to act intelligently, regardless of whether said agent is ‘artificial’ or not. The book very briefly mentions a few of Minsky’s thoughts on these topics, and people who are curious could probably do worse than read some of his stuff. This book seems like a place to start.]

Links:

GOFAI (“Good Old-Fashioned Artificial Intelligence”).
Ada Lovelace. Charles Babbage. Alan Turing. Turing machine. Turing test. Norbert WienerJohn von Neumann. W. Ross Ashby. William Grey Walter. Oliver SelfridgeKenneth Craik. Gregory Bateson. Frank Rosenblatt. Marvin Minsky. Seymour Papert.
A logical calculus of the ideas immanent in nervous activity (McCulloch & Pitts, 1943).
Propositional logic. Logic gate.
Arthur Samuel’s checkers player. Logic Theorist. General Problem Solver. The Homeostat. Pandemonium architecture. Perceptron. Cyc.
Fault-tolerant computer system.
Cybernetics.
Programmed Data Processor (PDP).
Artificial life.
Forward chaining. Backward chaining.
Rule-based programming. MYCIN. Dendral.
Semantic network.
Non-monotonic logic. Fuzzy logic.
Facial recognition system. Computer vision.
Bayesian statistics.
Helmholtz machine.
DQN algorithm.
AlphaGo. AlphaZero.
Human Problem Solving (Newell & Simon, 1970).
ACT-R.
NELL (Never-Ending Language Learning).
SHRDLU.
ALPAC.
Google translate.
Data mining. Sentiment analysis. Siri. Watson (computer).
Paro (robot).
Uncanny valley.
CogAff architecture.
Connectionism.
Constraint satisfaction.
Content-addressable memory.
Graceful degradation.
Physical symbol system hypothesis.

January 10, 2019 Posted by | Biology, Books, Computer science, Engineering, Language, Mathematics, Papers, Psychology, Statistics | Leave a comment

A few diabetes papers of interest

i. Islet Long Noncoding RNAs: A Playbook for Discovery and Characterization.

“This review will 1) highlight what is known about lncRNAs in the context of diabetes, 2) summarize the strategies used in lncRNA discovery pipelines, and 3) discuss future directions and the potential impact of studying the role of lncRNAs in diabetes.”

“Decades of mouse research and advances in genome-wide association studies have identified several genetic drivers of monogenic syndromes of β-cell dysfunction, as well as 113 distinct type 2 diabetes (T2D) susceptibility loci (1) and ∼60 loci associated with an increased risk of developing type 1 diabetes (T1D) (2). Interestingly, these studies discovered that most T1D and T2D susceptibility loci fall outside of coding regions, which suggests a role for noncoding elements in the development of disease (3,4). Several studies have demonstrated that many causal variants of diabetes are significantly enriched in regions containing islet enhancers, promoters, and transcription factor binding sites (5,6); however, not all diabetes susceptibility loci can be explained by associations with these regulatory regions. […] Advances in RNA sequencing (RNA-seq) technologies have revealed that mammalian genomes encode tens of thousands of RNA transcripts that have similar features to mRNAs, yet are not translated into proteins (7). […] detailed characterization of many of these transcripts has challenged the idea that the central role for RNA in a cell is to give rise to proteins. Instead, these RNA transcripts make up a class of molecules called noncoding RNAs (ncRNAs) that function either as “housekeeping” ncRNAs, such as transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs), that are expressed ubiquitously and are required for protein synthesis or as “regulatory” ncRNAs that control gene expression. While the functional mechanisms of short regulatory ncRNAs, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), and Piwi-interacting RNAs (piRNAs), have been described in detail (810), the most abundant and functionally enigmatic regulatory ncRNAs are called long noncoding RNAs (lncRNAs) that are loosely defined as RNAs larger than 200 nucleotides (nt) that do not encode for protein (1113). Although using a definition based strictly on size is somewhat arbitrary, this definition is useful both bioinformatically […] and technically […]. While the 200-nt size cutoff has simplified identification of lncRNAs, this rather broad classification means several features of lncRNAs, including abundance, cellular localization, stability, conservation, and function, are inherently heterogeneous (1517). Although this represents one of the major challenges of lncRNA biology, it also highlights the untapped potential of lncRNAs to provide a novel layer of gene regulation that influences islet physiology and pathophysiology.”

“Although the role of miRNAs in diabetes has been well established (9), analyses of lncRNAs in islets have lagged behind their short ncRNA counterparts. However, several recent studies provide evidence that lncRNAs are crucial components of the islet regulome and may have a role in diabetes (27). […] misexpression of several lncRNAs has been correlated with diabetes complications, such as diabetic nephropathy and retinopathy (2931). There are also preliminary studies suggesting that circulating lncRNAs, such as Gas5, MIAT1, and SENCR, may represent effective molecular biomarkers of diabetes and diabetes-related complications (32,33). Finally, several recent studies have explored the role of lncRNAs in the peripheral metabolic tissues that contribute to energy homeostasis […]. In addition to their potential as genetic drivers and/or biomarkers of diabetes and diabetes complications, lncRNAs can be exploited for the treatment of diabetes. For example, although tremendous efforts have been dedicated to generating replacement β-cells for individuals with diabetes (35,36), human pluripotent stem cell–based β-cell differentiation protocols remain inefficient, and the end product is still functionally and transcriptionally immature compared with primary human β-cells […]. This is largely due to our incomplete knowledge of in vivo differentiation regulatory pathways, which likely include a role for lncRNAs. […] Inherent characteristics of lncRNAs have also made them attractive candidates for drug targeting, which could be exploited for developing new diabetes therapies.”

“With the advancement of high-throughput sequencing techniques, the list of islet-specific lncRNAs is growing exponentially; however, functional characterization is missing for the majority of these lncRNAs. […] Tens of thousands of lncRNAs have been identified in different cell types and model organisms; however, their functions largely remain unknown. Although the tools for determining lncRNA function are technically restrictive, uncovering novel regulatory mechanisms will have the greatest impact on understanding islet function and identifying novel therapeutics for diabetes. To date, no biochemical assay has been used to directly determine the molecular mechanisms by which islet lncRNAs function, which highlights both the infancy of the field and the difficulty in implementing these techniques. […] Due to the infancy of the lncRNA field, most of the biochemical and genetic tools used to interrogate lncRNA function have only recently been developed or are adapted from techniques used to study protein-coding genes and we are only beginning to appreciate the limits and challenges of borrowing strategies from the protein-coding world.”

“The discovery of lncRNAs as a novel class of tissue-specific regulatory molecules has spawned an exciting new field of biology that will significantly impact our understanding of pancreas physiology and pathophysiology. As the field continues to grow, there is growing appreciation that lncRNAs will provide many of the missing components to existing molecular pathways that regulate islet biology and contribute to diabetes when they become dysfunctional. However, to date, most of the experimental emphasis on lncRNAs has focused on large-scale discovery using genome-wide approaches, and there remains a paucity of functional analysis.”

ii. Diabetes and Trajectories of Estimated Glomerular Filtration Rate: A Prospective Cohort Analysis of the Atherosclerosis Risk in Communities Study.

“Diabetes is among the strongest common risk factors for end-stage renal disease, and in industrialized countries, diabetes contributes to ∼50% of cases (3). Less is known about the pattern of kidney function decline associated with diabetes that precedes end-stage renal disease. Identifying patterns of estimated glomerular filtration rate (eGFR) decline could inform monitoring practices for people at high risk of chronic kidney disease (CKD) progression. A better understanding of when and in whom eGFR decline occurs would be useful for the design of clinical trials because eGFR decline >30% is now often used as a surrogate end point for CKD progression (4). Trajectories among persons with diabetes are of particular interest because of the possibility for early intervention and the prevention of CKD development. However, eGFR trajectories among persons with new diabetes may be complex due to the hypothesized period of hyperfiltration by which GFR increases, followed by progressive, rapid decline (5). Using data from the Atherosclerosis Risk in Communities (ARIC) study, an ongoing prospective community-based cohort of >15,000 participants initiated in 1987 with serial measurements of creatinine over 26 years, our aim was to characterize patterns of eGFR decline associated with diabetes, identify demographic, genetic, and modifiable risk factors within the population with diabetes that were associated with steeper eGFR decline, and assess for evidence of early hyperfiltration.”

“We categorized people into groups of no diabetes, undiagnosed diabetes, and diagnosed diabetes at baseline (visit 1) and compared baseline clinical characteristics using ANOVA for continuous variables and Pearson χ2 tests for categorical variables. […] To estimate individual eGFR slopes over time, we used linear mixed-effects models with random intercepts and random slopes. These models were fit on diabetes status at baseline as a nominal variable to adjust the baseline level of eGFR and included an interaction term between diabetes status at baseline and time to estimate annual decline in eGFR by diabetes categories. Linear mixed models were run unadjusted and adjusted, with the latter model including the following diabetes and kidney disease–related risk factors: age, sex, race–center, BMI, systolic blood pressure, hypertension medication use, HDL, prevalent coronary heart disease, annual family income, education status, and smoking status, as well as each variable interacted with time. Continuous covariates were centered at the analytic population mean. We tested model assumptions and considered different covariance structures, comparing nested models using Akaike information criteria. We identified the unstructured covariance model as the most optimal and conservative approach. From the mixed models, we described the overall mean annual decline by diabetes status at baseline and used the random effects to estimate best linear unbiased predictions to describe the distributions of yearly slopes in eGFR by diabetes status at baseline and displayed them using kernel density plots.”

“Because of substantial variation in annual eGFR slope among people with diagnosed diabetes, we sought to identify risk factors that were associated with faster decline. Among those with diagnosed diabetes, we compared unadjusted and adjusted mean annual decline in eGFR by race–APOL1 risk status (white, black– APOL1 low risk, and black–APOL1 high risk) [here’s a relevant link, US], systolic blood pressure […], smoking status […], prevalent coronary heart disease […], diabetes medication use […], HbA1c […], and 1,5-anhydroglucitol (≥10 and <10 μg/mL) [relevant link, US]. Because some of these variables were only available at visit 2, we required that participants included in this subgroup analysis attend both visits 1 and 2 and not be missing information on APOL1 or the variables assessed at visit 2 to ensure a consistent sample size. In addition to diabetes and kidney disease–related risk factors in the adjusted model, we also included diabetes medication use and HbA1c to account for diabetes severity in these analyses. […] to explore potential hyperfiltration, we used a linear spline model to allow the slope to change for each diabetes category between the first 3 years of follow-up (visit 1 to visit 2) and the subsequent time period (visit 2 to visit 5).”

“There were 15,517 participants included in the analysis: 13,698 (88%) without diabetes, 634 (4%) with undiagnosed diabetes, and 1,185 (8%) with diagnosed diabetes at baseline. […] At baseline, participants with undiagnosed and diagnosed diabetes were older, more likely to be black or have hypertension and coronary heart disease, and had higher mean BMI and lower mean HDL compared with those without diabetes […]. Income and education levels were also lower among those with undiagnosed and diagnosed diabetes compared with those without diabetes. […] Overall, there was a nearly linear association between eGFR and age over time, regardless of diabetes status […]. The crude mean annual decline in eGFR was slowest among those without diabetes at baseline (decline of −1.6 mL/min/1.73 m2/year [95% CI −1.6 to −1.5]), faster among those with undiagnosed diabetes compared with those without diabetes (decline of −2.1 mL/min/1.73 m2/year [95% CI −2.2 to −2.0][…]), and nearly twice as rapid among those with diagnosed diabetes compared with those without diabetes (decline of −2.9 mL/min/1.73 m2/year [95% CI −3.0 to −2.8][…]). Adjustment for diabetes and kidney disease–related risk factors attenuated the results slightly, but those with undiagnosed and diagnosed diabetes still had statistically significantly steeper declines than those without diabetes (decline among no diabetes −1.4 mL/min/1.73 m2/year [95% CI −1.5 to −1.4] and decline among undiagnosed diabetes −1.8 mL/min/1.73 m2/year [95% CI −2.0 to −1.7], difference vs. no diabetes of −0.4 mL/min/1.73 m2/year [95% CI −0.5 to −0.3; P < 0.001]; decline among diagnosed diabetes −2.5 mL/min/1.73 m2/year [95% CI −2.6 to −2.4], difference vs. no diabetes of −1.1 mL/min/1.73 m2/ year [95% CI −1.2 to −1.0; P < 0.001]). […] The decline in eGFR per year varied greatly across individuals, particularly among those with diabetes at baseline […] Among participants with diagnosed diabetes at baseline, those who were black, had systolic blood pressure ≥140 mmHg, used diabetes medications, had an HbA1c ≥7% [≥53 mmol/mol], or had 1,5-anhydroglucitol <10 μg/mL were at risk for steeper annual declines than their counterparts […]. Smoking status and prevalent coronary heart disease were not associated with significantly steeper eGFR decline in unadjusted analyses. Adjustment for risk factors, diabetes medication use, and HbA1c attenuated the differences in decline for all subgroups with the exception of smoking status, leaving black race along with APOL1-susceptible genotype, systolic blood pressure ≥140 mmHg, current smoking, insulin use, and HbA1c ≥9% [≥75 mmol/mol] as the risk factors indicative of steeper decline.”

CONCLUSIONS Diabetes is an important risk factor for kidney function decline. Those with diagnosed diabetes declined almost twice as rapidly as those without diabetes. Among people with diagnosed diabetes, steeper declines were seen in those with modifiable risk factors, including hypertension and glycemic control, suggesting areas for continued targeting in kidney disease prevention. […] Few other community-based studies have evaluated differences in kidney function decline by diabetes status over a long period through mid- and late life. One study of 10,184 Canadians aged ≥66 years with creatinine measured during outpatient visits showed results largely consistent with our findings but with much shorter follow-up (median of 2 years) (19). Other studies of eGFR change in a general population have found smaller declines than our results (20,21). A study conducted in Japanese participants aged 40–79 years found a decline of only −0.4 mL/min/1.73 m2/year over the course of two assessments 10 years apart (compared with our estimate among those without diabetes: −1.6 mL/min/1.73 m2/year). This is particularly interesting, as Japan is known to have a higher prevalence of CKD and end-stage renal disease than the U.S. (20). However, this study evaluated participants over a shorter time frame and required attendance at both assessments, which may have decreased the likelihood of capturing severe cases and resulted in underestimation of decline.”

“The Baltimore Longitudinal Study of Aging also assessed kidney function over time in a general population of 446 men, ranging in age from 22 to 97 years at baseline, each with up to 14 measurements of creatinine clearance assessed between 1958 and 1981 (21). They also found a smaller decline than we did (−0.8 mL/min/year), although this study also had notable differences. Their main analysis excluded participants with hypertension and history of renal disease or urinary tract infection and those treated with diuretics and/or antihypertensive medications. Without those exclusions, their overall estimate was −1.1 mL/min/year, which better reflects a community-based population and our results. […] In our evaluation of risk factors that might explain the variation in decline seen among those with diagnosed diabetes, we observed that black race, systolic blood pressure ≥140 mmHg, insulin use, and HbA1c ≥9% (≥75 mmol/mol) were particularly important. Although the APOL1 high-risk genotype is a known risk factor for eGFR decline, African Americans with low-risk APOL1 status continued to be at higher risk than whites even after adjustment for traditional risk factors, diabetes medication use, and HbA1c.”

“Our results are relevant to the design and conduct of clinical trials. Hard clinical outcomes like end-stage renal disease are relatively rare, and a 30–40% decline in eGFR is now accepted as a surrogate end point for CKD progression (4). We provide data on patient subgroups that may experience accelerated trajectories of kidney function decline, which has implications for estimating sample size and ensuring adequate power in future clinical trials. Our results also suggest that end points of eGFR decline might not be appropriate for patients with new-onset diabetes, in whom declines may actually be slower than among persons without diabetes. Slower eGFR decline among those with undiagnosed diabetes, who are likely early in the course of diabetes, is consistent with the hypothesis of hyperfiltration. Similar to other studies, we found that persons with undiagnosed diabetes had higher GFR at the outset, but this was a transient phenomenon, as they ultimately experienced larger declines in kidney function than those without diabetes over the course of follow-up (2325). Whether hyperfiltration is a universal aspect of early disease and, if not, whether it portends worse long-term outcomes is uncertain. Existing studies investigating hyperfiltration as a precursor to adverse kidney outcomes are inconsistent (24,26,27) and often confounded by diabetes severity factors like duration (27). We extended this literature by separating undiagnosed and diagnosed diabetes to help address that confounding.”

iii. Saturated Fat Is More Metabolically Harmful for the Human Liver Than Unsaturated Fat or Simple Sugars.

OBJECTIVE Nonalcoholic fatty liver disease (i.e., increased intrahepatic triglyceride [IHTG] content), predisposes to type 2 diabetes and cardiovascular disease. Adipose tissue lipolysis and hepatic de novo lipogenesis (DNL) are the main pathways contributing to IHTG. We hypothesized that dietary macronutrient composition influences the pathways, mediators, and magnitude of weight gain-induced changes in IHTG.

RESEARCH DESIGN AND METHODS We overfed 38 overweight subjects (age 48 ± 2 years, BMI 31 ± 1 kg/m2, liver fat 4.7 ± 0.9%) 1,000 extra kcal/day of saturated (SAT) or unsaturated (UNSAT) fat or simple sugars (CARB) for 3 weeks. We measured IHTG (1H-MRS), pathways contributing to IHTG (lipolysis ([2H5]glycerol) and DNL (2H2O) basally and during euglycemic hyperinsulinemia), insulin resistance, endotoxemia, plasma ceramides, and adipose tissue gene expression at 0 and 3 weeks.

RESULTS Overfeeding SAT increased IHTG more (+55%) than UNSAT (+15%, P < 0.05). CARB increased IHTG (+33%) by stimulating DNL (+98%). SAT significantly increased while UNSAT decreased lipolysis. SAT induced insulin resistance and endotoxemia and significantly increased multiple plasma ceramides. The diets had distinct effects on adipose tissue gene expression.”

CONCLUSIONS NAFLD has been shown to predict type 2 diabetes and cardiovascular disease in multiple studies, even independent of obesity (1), and also to increase the risk of progressive liver disease (17). It is therefore interesting to compare effects of different diets on liver fat content and understand the underlying mechanisms. We examined whether provision of excess calories as saturated (SAT) or unsaturated (UNSAT) fats or simple sugars (CARB) influences the metabolic response to overfeeding in overweight subjects. All overfeeding diets increased IHTGs. The SAT diet induced a greater increase in IHTGs than the UNSAT diet. The composition of the diet altered sources of excess IHTGs. The SAT diet increased lipolysis, whereas the CARB diet stimulated DNL. The SAT but not the other diets increased multiple plasma ceramides, which increase the risk of cardiovascular disease independent of LDL cholesterol (18). […] Consistent with current dietary recommendations (3638), the current study shows that saturated fat is the most harmful dietary constituent regarding IHTG accumulation.”

iv. Primum Non Nocere: Refocusing Our Attention on Severe Hypoglycemia Prevention.

“Severe hypoglycemia, defined as low blood glucose requiring assistance for recovery, is arguably the most dangerous complication of type 1 diabetes as it can result in permanent cognitive impairment, seizure, coma, accidents, and death (1,2). Since the Diabetes Control and Complications Trial (DCCT) demonstrated that intensive intervention to normalize glucose prevents long-term complications but at the price of a threefold increase in the rate of severe hypoglycemia (3), hypoglycemia has been recognized as the major limitation to achieving tight glycemic control. Severe hypoglycemia remains prevalent among adults with type 1 diabetes, ranging from ∼1.4% per year in the DCCT/EDIC (Epidemiology of Diabetes Interventions and Complications) follow-up cohort (4) to ∼8% in the T1D Exchange clinic registry (5).

One the greatest risk factors for severe hypoglycemia is impaired awareness of hypoglycemia (6), which increases risk up to sixfold (7,8). Hypoglycemia unawareness results from deficient counterregulation (9), where falling glucose fails to activate the autonomic nervous system to produce neuroglycopenic symptoms that normally help patients identify and respond to episodes (i.e., sweating, palpitations, hunger) (2). An estimated 20–25% of adults with type 1 diabetes have impaired hypoglycemia awareness (8), which increases to more than 50% after 25 years of disease duration (10).

Screening for hypoglycemia unawareness to identify patients at increased risk of severe hypoglycemic events should be part of routine diabetes care. Self-identified impairment in awareness tends to agree with clinical evaluation (11). Therefore, hypoglycemia unawareness can be easily and effectively screened […] Interventions for hypoglycemia unawareness include a range of behavioral and medical options. Avoiding hypoglycemia for at least several weeks may partially reverse hypoglycemia unawareness and reduce risk of future episodes (1). Therefore, patients with hypoglycemia and unawareness may be advised to raise their glycemic and HbA1c targets (1,2). Diabetes technology can play a role, including continuous subcutaneous insulin infusion (CSII) to optimize insulin delivery, continuous glucose monitoring (CGM) to give technological awareness in the absence of symptoms (14), or the combination of the two […] Aside from medical management, structured or hypoglycemia-specific education programs that aim to prevent hypoglycemia are recommended for all patients with severe hypoglycemia or hypoglycemia unawareness (14). In randomized trials, psychoeducational programs that incorporate increased education, identification of personal risk factors, and behavior change support have improved hypoglycemia unawareness and reduced the incidence of both nonsevere and severe hypoglycemia over short periods of follow-up (17,18) and extending up to 1 year (19).”

“Given that the presence of hypoglycemia unawareness increases the risk of severe hypoglycemia, which is the strongest predictor of a future episode (2,4), the implication that intervention can break the life-threatening and traumatizing cycle of hypoglycemia unawareness and severe hypoglycemia cannot be overstated. […] new evidence of durability of effect across treatment regimen without increasing the risk for long-term complications creates an imperative for action. In combination with existing screening tools and a body of literature investigating novel interventions for hypoglycemia unawareness, these results make the approach of screening, recognition, and intervention very compelling as not only a best practice but something that should be incorporated in universal guidelines on diabetes care, particularly for individuals with type 1 diabetes […] Hyperglycemia is […] only part of the puzzle in diabetes management. Long-term complications are decreasing across the population with improved interventions and their implementation (24). […] it is essential to shift our historical obsession with hyperglycemia and its long-term complications to equally emphasize the disabling, distressing, and potentially fatal near-term complication of our treatments, namely severe hypoglycemia. […] The health care providers’ first dictum is primum non nocere — above all, do no harm. ADA must refocus our attention on severe hypoglycemia as an iatrogenic and preventable complication of our interventions.”

v. Anti‐vascular endothelial growth factor combined with intravitreal steroids for diabetic macular oedema.

“Background

The combination of steroid and anti‐vascular endothelial growth factor (VEGF) intravitreal therapeutic agents could potentially have synergistic effects for treating diabetic macular oedema (DMO). On the one hand, if combined treatment is more effective than monotherapy, there would be significant implications for improving patient outcomes. Conversely, if there is no added benefit of combination therapy, then people could be potentially exposed to unnecessary local or systemic side effects.

Objectives

To assess the effects of intravitreal agents that block vascular endothelial growth factor activity (anti‐VEGF agents) plus intravitreal steroids versus monotherapy with macular laser, intravitreal steroids or intravitreal anti‐VEGF agents for managing DMO.”

“There were eight RCTs (703 participants, 817 eyes) that met our inclusion criteria with only three studies reporting outcomes at one year. The studies took place in Iran (3), USA (2), Brazil (1), Czech Republic (1) and South Korea (1). […] When comparing anti‐VEGF/steroid with anti‐VEGF monotherapy as primary therapy for DMO, we found no meaningful clinical difference in change in BCVA [best corrected visual acuity] […] or change in CMT [central macular thickness] […] at one year. […] There was very low‐certainty evidence on intraocular inflammation from 8 studies, with one event in the anti‐VEGF/steroid group (313 eyes) and two events in the anti‐VEGF group (322 eyes). There was a greater risk of raised IOP (Peto odds ratio (OR) 8.13, 95% CI 4.67 to 14.16; 635 eyes; 8 RCTs; moderate‐certainty evidence) and development of cataract (Peto OR 7.49, 95% CI 2.87 to 19.60; 635 eyes; 8 RCTs; moderate‐certainty evidence) in eyes receiving anti‐VEGF/steroid compared with anti‐VEGF monotherapy. There was low‐certainty evidence from one study of an increased risk of systemic adverse events in the anti‐VEGF/steroid group compared with the anti‐VEGF alone group (Peto OR 1.32, 95% CI 0.61 to 2.86; 103 eyes).”

“One study compared anti‐VEGF/steroid versus macular laser therapy. At one year investigators did not report a meaningful difference between the groups in change in BCVA […] or change in CMT […]. There was very low‐certainty evidence suggesting an increased risk of cataract in the anti‐VEGF/steroid group compared with the macular laser group (Peto OR 4.58, 95% 0.99 to 21.10, 100 eyes) and an increased risk of elevated IOP in the anti‐VEGF/steroid group compared with the macular laser group (Peto OR 9.49, 95% CI 2.86 to 31.51; 100 eyes).”

“Authors’ conclusions

Combination of intravitreal anti‐VEGF plus intravitreal steroids does not appear to offer additional visual benefit compared with monotherapy for DMO; at present the evidence for this is of low‐certainty. There was an increased rate of cataract development and raised intraocular pressure in eyes treated with anti‐VEGF plus steroid versus anti‐VEGF alone. Patients were exposed to potential side effects of both these agents without reported additional benefit.”

vi. Association between diabetic foot ulcer and diabetic retinopathy.

“More than 25 million people in the United States are estimated to have diabetes mellitus (DM), and 15–25% will develop a diabetic foot ulcer (DFU) during their lifetime [1]. DFU is one of the most serious and disabling complications of DM, resulting in significantly elevated morbidity and mortality. Vascular insufficiency and associated neuropathy are important predisposing factors for DFU, and DFU is the most common cause of non-traumatic foot amputation worldwide. Up to 70% of all lower leg amputations are performed on patients with DM, and up to 85% of all amputations are preceded by a DFU [2, 3]. Every year, approximately 2–3% of all diabetic patients develop a foot ulcer, and many require prolonged hospitalization for the treatment of ensuing complications such as infection and gangrene [4, 5].

Meanwhile, a number of studies have noted that diabetic retinopathy (DR) is associated with diabetic neuropathy and microvascular complications [610]. Despite the magnitude of the impact of DFUs and their consequences, little research has been performed to investigate the characteristics of patients with a DFU and DR. […] the aim of this study was to investigate the prevalence of DR in patients with a DFU and to elucidate the potential association between DR and DFUs.”

“A retrospective review was conducted on DFU patients who underwent ophthalmic and vascular examinations within 6 months; 100 type 2 diabetic patients with DFU were included. The medical records of 2496 type 2 diabetic patients without DFU served as control data. DR prevalence and severity were assessed in DFU patients. DFU patients were compared with the control group regarding each clinical variable. Additionally, DFU patients were divided into two groups according to DR severity and compared. […] Out of 100 DFU patients, 90 patients (90%) had DR and 55 (55%) had proliferative DR (PDR). There was no significant association between DR and DFU severities (R = 0.034, p = 0.734). A multivariable analysis comparing type 2 diabetic patients with and without DFUs showed that the presence of DR [OR, 226.12; 95% confidence interval (CI), 58.07–880.49; p < 0.001] and proliferative DR [OR, 306.27; 95% CI, 64.35–1457.80; p < 0.001), higher HbA1c (%, OR, 1.97, 95% CI, 1.46–2.67; p < 0.001), higher serum creatinine (mg/dL, OR, 1.62, 95% CI, 1.06–2.50; p = 0.027), older age (years, OR, 1.12; 95% CI, 1.06–1.17; p < 0.001), higher pulse pressure (mmHg, OR, 1.03; 95% CI, 1.00–1.06; p = 0.025), lower cholesterol (mg/dL, OR, 0.94; 95% CI, 0.92–0.97; p < 0.001), lower BMI (kg/m2, OR, 0.87, 95% CI, 0.75–1.00; p = 0.044) and lower hematocrit (%, OR, 0.80, 95% CI, 0.74–0.87; p < 0.001) were associated with DFUs. In a subgroup analysis of DFU patients, the PDR group had a longer duration of diabetes mellitus, higher serum BUN, and higher serum creatinine than the non-PDR group. In the multivariable analysis, only higher serum creatinine was associated with PDR in DFU patients (OR, 1.37; 95% CI, 1.05–1.78; p = 0.021).

Conclusions

Diabetic retinopathy is prevalent in patients with DFU and about half of DFU patients had PDR. No significant association was found in terms of the severity of these two diabetic complications. To prevent blindness, patients with DFU, and especially those with high serum creatinine, should undergo retinal examinations for timely PDR diagnosis and management.”

August 29, 2018 Posted by | Diabetes, Epidemiology, Genetics, Medicine, Molecular biology, Nephrology, Ophthalmology, Statistics, Studies | Leave a comment

Combinatorics (II)

I really liked this book. Below I have added some links and quotes related to the second half of the book’s coverage.

“An n × n magic square, or a magic square of order n, is a square array of numbers — usually (but not necessarily) the numbers from 1 to n2 — arranged in such a way that the sum of the numbers in each of the n rows, each of the n columns, or each of the two main diagonals is the same. A semi-magic square is a square array in which the sum of the numbers in each row or column, but not necessarily the diagonals, is the same. We note that if the entries are 1 to n2, then the sum of the numbers in the whole array is
1 + 2 + 3 + … + n2n2 (n2 + 1) / 2
on summing the arithmetic progression. Because the n rows and columns have the same ‘magic sum’, the numbers in each single row or column add up to (1/n)th of this, which is n (n2+1) / 2 […] An nn latin squareor a latin square of order n, is a square array with n symbols arranged so that each symbol appears just once in each row and column. […] Given a latin square, we can obtain others by rearranging the rows or the columns, or by permuting the symbols. For an n × n latin square with symbols 1, 2, … , n, we can thereby arrange that the numbers in the first row and the first column appear in order as 1, 2, … , n. Such a latin square is called normalized […] A familiar form of latin square is the sudoku puzzle […] How many n x n latin squares are there for a given order of n? The answer is known only for n ≤ 11. […] The number of normalized latin squares of order 11 has an impressive forty-eight digits.”

“A particular type of latin square is the cyclic square, where the symbols appear in the same cyclic order, moving one place to the left in each successive row, so that the entry at the beginning of each line appears at the end of the next one […] An extension of this idea is where the symbols move more places to the left in each successive row […] We can construct a latin square row by row from its first row, always taking care that no symbol appears twice in any column. […] An important concept […] is that of a set of orthogonal latin squares […] two n × n latin squares are orthogonal if, when superimposed, each of the n2 possible pairings of a symbol from each square appears exactly once. […] pairs of orthogonal latin squares are […] used in agricultural experiments. […] We can extend the idea of orthogonality beyond pairs […] A set of mutually orthogonal latin squares (sometimes abbreviated to MOLS) is a set of latin squares, any two of which are orthogonal […] Note that there can be at most n-1 MOLS of order n. […] A full set of MOLS is called a complete set […] We can ask the following question: For which values of n does there exist a complete set of n × n mutually orthogonal latin squares? As several authors have shown, a complete set exists whenever n is a prime number (other than 2) or a power of a prime […] In 1922, H. L. MacNeish generalized this result by observing that if n has prime factorization p, then the number of MOLS is at least min (p1a x p2b, … , pkz) – 1″.

“Consider the following [problem] involving comparisons between a number of varieties of a commodity: A consumer organization wishes to compare seven brands of detergent and arranges a number of tests. But since it may be uneconomic or inconvenient for each tester to compare all seven brands it is decided that each tester should compare just three brands. How should the trials be organized if each brand is to be tested the same number of times and each pair of brands is to be compared directly? […] A block design consists of a set of v varieties arranged into b blocks. […] [if we] further assume that each block contains the same number k of varieties, and each variety appears in the same number r of blocks […] [the design is] called [an] equireplicate design […] for every block design we have v x r = b x k. […] It would clearly be preferable if all pairs of varieties in a design were compared the same number of times […]. Such a design is called balanced, or a balanced incomplete-block design (often abbreviated to BIBD). The number of times that any two varieties are compared is usually denoted by λ […] In a balanced block design the parameters v, b, k, r, and λ are not independent […] [Rather it is the case that:] r x (k -1) = λ x (v – 1). […] The conditions v x r = b x k and r x (k -1) = λ x (v – 1) are both necessary for a design to be balanced, but they’re not sufficient since there are designs satisfying both conditions which are not balanced. Another necessary condition for a design to be balanced is v ≤ b, a result known as Fisher’s inequality […] A balanced design for which v = b, and therefore k = r, is called a symmetric design“.

“A block design with v varieties is resolvable if its blocks can be rearranged into subdesigns, called replicates, each of which contains every variety just once. [….] we define a finite projective plane to be an arrangement of a finite number of points and a finite number of lines with the properties that: [i] Any two points lie on exactly one line. [ii] Any two lines pass through exactly one point.
Note that this differs from our usual Euclidean geometry, where any two lines pass through exactly one point unless they’re parallel. Omitting these italicized words produces a completely different type of geometry from the one we’re used to, since there’s now a ‘duality’ or symmetry between points and lines, according to which any statement about points lying on lines gives rise to a statement about lines passing through points, and vice versa. […] We say that the finite projective plane has order n if each line contains n + 1 points. […] removing a single line from a projective plane of order n, and the n + 1 points on this line, gives a square pattern with n2 points and n2 + n lines where each line contains n points and each point lies on n + 1 lines. Such a diagram is called an affine plane of order n. […] This process is reversible. If we start with an affine plane of order n and add another line joined up appropriately, we get a projective plane of order n. […] Every finite projective plane gives rise to a symmetric balanced design. […] In general, a finite projective plane of order n, with n2 + n + 1 points and lines and with n + 1 points on each line and n + 1 lines through each point, gives rise to a balanced symmetric design with parameters v = b = n2 + n + 1, k = r = n + 1, and λ = 1. […] Every finite affine plane gives rise to a resolvable design. […] In general, an affine plane of order n, obtained by removing a line and n + 1 points from a projective plane of order n, gives rise to a resolvable design with parameters v = n2 , b = n2 + n , k = n , and r = n + 1. […] Every finite affine plane corresponds to a complete set of orthogonal latin squares.”

Links:

Regular polygon.
Polyhedron.
Internal and external angles.
Triangular tiling. Square tiling. Hexagonal tiling.
Semiregular tessellations.
Penrose tiling.
Platonic solid.
Euler’s polyhedron formula.
Prism (geometry). Antiprism.
Fullerene.
Geodesic dome.
Graph theory.
Complete graph. Complete bipartite graph. Cycle graph.
Degree (graph theory).
Handshaking lemma.
Ramsey theory.
Tree (graph theory).
Eulerian and Hamiltonian Graphs. Hamiltonian path.
Icosian game.
Knight’s tour problem.
Planar graph. Euler’s formula for plane graphs.
Kuratowski’s theorem.
Dual graph.
Lo Shu Square.
Melencolia I.
Euler’s Thirty-six officers problem.
Steiner triple system.
Partition (number theory).
Pentagonal number. Pentagonal number theorem.
Ramanujan’s congruences.

August 23, 2018 Posted by | Books, Mathematics, Statistics | Leave a comment

Some observations on a cryptographic problem

It’s been a long time since I last posted one of these sort of ‘rootless’ posts which are not based on a specific book or a specific lecture or something along those lines, but a question on r/science made me think about these topics and start writing a bit about it, and I decided I might as well add my thoughts and ideas here.

The reddit question which motivated me to write this post was this one: “Is it difficult to determine the password for an encryption if you are given both the encrypted and unencrypted message?

By “difficult” I mean requiring an inordinate amount of computation. If given both an encrypted and unencrypted file/message, is it reasonable to be able to recover the password that was used to encrypt the file/message?”

Judging from the way the question is worded, the inquirer obviously knows very little about these topics, but that was part of what motivated me when I started out writing; s/he quite obviously has a faulty model of how this kind of stuff actually works, and just by virtue of the way he or she asks his/her question s/he illustrates some ways in which s/he gets things wrong.

When I decided to transfer my efforts towards discussing these topics to the blog I also implicitly decided against using language that would be expected to be easily comprehensible for the original inquirer, as s/he was no longer in the target group and there’s a cost to using that kind of language when discussing technical matters. I’ve sort of tried to make this post both useful and readable to people not all that familiar with the related fields, but I tend to find it difficult to evaluate the extent to which I’ve succeeded when I try to do things like that.

I decided against adding stuff already commented on when I started out writing this, so I’ll not e.g. repeat noiwontfixyourpc’s reply below. However I have added some other observations that seem to me to be relevant and worth mentioning to people who might consider asking a similar question to the one the original inquirer asked in that thread:

i. Finding a way to make plaintext turn into cipher text (…or cipher text into plaintext; and no, these two things are not actually always equivalent, see below…) is a very different (and in many contexts a much easier problem) than finding out the actual encryption scheme that is at work producing the text strings you observe. There can be many, many different ways to go from a specific sample of plaintext to a specific sample of ciphertext, and most of the solutions won’t work if you’re faced with a new piece of ciphertext; especially not if the original samples are small, so only a small amount of (potential) information would be expected to be included in the text strings.

If you only get a small amount of plaintext and corresponding cipher text you may decide that algorithm A is the one that was applied to the message, even if the algorithm actually applied was a more complex algorithm, B. To illustrate in a very simple way how this might happen, A might be a particular case of B, because B is a superset of A and a large number of other potential encryption algorithms applied in the encryption scheme B (…or the encryption scheme C, because B also happens to be a subset of C, or… etc.). In such a context A might be an encryption scheme/approach that perhaps only applies in very specific contexts; for example (part of) the coding algorithm might have been to decide that ‘on next Tuesday, we’ll use this specific algorithm to translate plaintext into cipher text, and we’ll never use that specific translation-/mapping algorithm (which may be but one component of the encryption algorithm) again’. If such a situation applies then you’re faced with the problem that even if your rule ‘worked’ in that particular instance, in terms of translating your plaintext into cipher text and vice versa, it only ‘worked’ because you blindly fitted the two data-sets in a way that looked right, even if you actually had no idea how the coding scheme really worked (you only guessed A, not B, and in this particular instance A’s never actually going to happen again).

On a more general level some of the above comments incidentally in my view quite obviously links to results from classical statistics; there are many ways to link random variables through data fitting methods, but reliably identifying proper causal linkages through the application of such approaches is, well, difficult (and, according to some, often ill-advised)…

ii. In my view, it does not seem possible in general to prove that any specific proposed encryption/decryption algorithm is ‘the correct one’. This is because the proposed algorithm will never be a unique solution to the problem you’re evaluating. How are you going to convince me that The True Algorithm is not a more general/complex one (or perhaps a completely different one – see iii. below) than the one you propose, and that your solution is not missing relevant variables? The only way to truly test if the proposed algorithm is a valid algorithm is to test it on new data and compare its performance on this new data set with the performances of competing variables/solution proposals which also managed to correctly link cipher text and plaintext. If the algorithm doesn’t work on the new data, you got it wrong. If it does work on new data, well, you might still just have been lucky. You might get more confident with more correctly-assessed (…guessed?) data, but you never get certain. In other similar contexts a not uncommon approach for trying to get around these sorts of problems is to limit the analysis to a subset of the data available in order to obtain the algorithm, and then using the rest of the data for validation purposes (here’s a relevant link), but here even with highly efficient estimation approaches you almost certainly will run out of information (/degrees of freedom) long before you get anywhere if the encryption algorithm is at all non-trivial. In these settings information is likely to be a limiting resource.

iii. There are many different types of encryption schemes, and people who ask questions like the one above tend, I believe, to have a quite limited view of which methods and approaches are truly available to one who desires secrecy when exchanging information with others. Imagine a situation where the plaintext is ‘See you next Wednesday’ and the encrypted text is an English translation of Tolstoy’s book War and Peace (or, to make it even more fun, all pages published on the English version of Wikipedia, say on November the 5th, 2017 at midnight GMT). That’s an available encryption approach that might be applied. It might be a part (‘A’) of a more general (‘B’) encryption approach of linking specific messages from a preconceived list of messages, which had been considered worth sending in the future when the algorithm was chosen, to specific book titles decided on in advance. So if you want to say ‘good Sunday!’, Eve gets to read the Bible and see where that gets her. You could also decide that in half of all cases the book cipher text links to specific messages from a list but in the other half of the cases what you actually mean to communicate is on page 21 of the book; this might throw a hacker who saw a combined cipher text and plaintext combination resulting from that part of the algorithm off in terms of the other half, and vice versa – and it illustrates well one of the key problems you’re faced with as an attacker when working on cryptographic schemes about which you have limited knowledge; the opponent can always add new layers on top of the ones that already exist/apply to make the problem harder to solve. And so you could also link the specific list message with some really complicated cipher-encrypted version of the Bible. There’s a lot more to encryption schemes than just exchanging a few letters here and there. On related topics, see this link. On a different if related topic, people who desire secrecy when exchanging information may also attempt to try to hide the fact that any secrets are exchanged in the first place. See also this.

iv. The specific usage of the word ‘password’ in the original query calls for comment for multiple reasons, some of which have been touched upon above, perhaps mainly because it implicitly betrays a lack of knowledge about how modern cryptographic systems actually work. The thing is, even if you might consider an encryption scheme to just be an advanced sort of ‘password’, finding the password (singular) is not always the task you’re faced with today. In symmetric-key algorithm settings you might sort-of-kind-of argue that it sort-of is – in such settings you might say that you have one single (collection of) key(s) which you use to encrypt messages and also use to decrypt the messages. So you can both encrypt and decrypt the message using the same key(s), and so you only have one ‘password’. That’s however not how asymmetric-key encryption works. As wiki puts it: “In an asymmetric key encryption scheme, anyone can encrypt messages using the public key, but only the holder of the paired private key can decrypt.”

This of course relates to what you actually want to do/achieve when you get your samples of cipher text and plaintext. In some cryptographic contexts by design the route you need to to go to get from cipher text to plaintext is conceptually different from the route you need to go to get from plaintext to cipher text. And some of the ‘passwords’ that relate to how the schemes work are public knowledge by design.

v. I have already touched a bit upon the problem of the existence of an information constraint, but I realized I probably need to spell this out in a bit more detail. The original inquirer to me seems implicitly to be under the misapprehension that computational complexity is the only limiting constraint here (“By “difficult” I mean requiring an inordinate amount of computation.”). Given the setting he or she proposes, I don’t think that’s true, and why that is is sort of interesting.

If you think about what kind of problem you’re facing, what you have here in this setting is really a very limited amount of data which relates in an unknown manner to an unknown data-generating process (‘algorithm’). There are, as has been touched upon, in general many ways to obtain linkage between two data sets (the cipher text and the plaintext) using an algorithm – too many ways for comfort, actually. The search space is large, there are too many algorithms to consider; or equivalently, the amount of information supplied by the data will often be too small for us to properly evaluate the algorithms under consideration. An important observation is that more complex algorithms will both take longer to calculate (‘identify’ …at least as candidates) and be expected to require more data to evaluate, at least to the extent that algorithmic complexity constrains the data (/relates to changes in data structure/composition that needs to be modeled in order to evaluate/identify the goal algorithm). If the algorithm says a different encryption rule is at work on Wednesdays, you’re going to have trouble figuring that out if you only got hold of a cipher text/plaintext combination derived from an exchange which took place on a Saturday. There are methods from statistics that might conceivably help you deal with problems like these, but they have their own issues and trade-offs. You might limit yourself to considering only settings where you have access to all known plaintext and cipher text combinations, so you got both Wednesday and Saturday, but even here you can’t be safe – next (metaphorical, I probably at this point need to add) Friday might be different from last (metaphorical) Friday, and this could even be baked into the algorithm in very non-obvious ways.

The above remarks might give you the idea that I’m just coming up with these kinds of suggestions to try to foil your approaches to figuring out the algorithm ‘by cheating’ (…it shouldn’t matter whether or not it was ‘sent on a Saturday’), but the main point is that a complex encryption algorithm is complex, and even if you see it applied multiple times you might not get enough information about how it works from the data suggested to be able to evaluate if you guessed right. In fact, given a combination of a sparse data set (one message, or just a few messages, in plaintext and cipher text) and a complex algorithm involving a very non-obvious mapping function, the odds are strongly against you.

vi. I had the thought that one reason why the inquirer might be confused about some of these things is that s/he might well be aware of the existence of modern cryptographic techniques which do rely to a significant extent on computational complexity aspects. I.e., here you do have settings where you’re asked to provide ‘the right answer’ (‘the password’), but it’s hard to calculate the right answer in a reasonable amount of time unless you have the relevant (private) information at hand – see e.g. these links for more. One way to think about how such a problem relates to the other problem at hand (you have been presented with samples of cipher text and plaintext and you want to guess all the details about how the encryption and decryption schemes which were applied work) is that this kind of algorithm/approach may be applied in combination with other algorithmic approaches to encrypt/decrypt the text you’re analyzing. A really tough prime factorization problem might for all we know be an embedded component of the cryptographic process that is applied to our text. We could call it A.

In such a situation we would definitely be in trouble because stuff like prime factorization is really hard and computationally complex, and to make matters worse just looking at the plaintext and the cipher text would not make it obvious to us that a prime factorization scheme had even been applied to the data. But a really important point is that even if such a tough problem was not present and even if only relatively less computationally demanding problems were involved, we almost certainly still just wouldn’t have enough information to break any semi-decent encryption algorithm based on a small sample of plaintext and cipher text. It might help a little bit, but in the setting contemplated by the inquirer a ‘faster computer’ (/…’more efficient decision algorithm’, etc.) can only help so much.

vii. Shannon and Kerckhoffs may have a point in a general setting, but in specific settings like this particular one I think it is well worth taking into account the implications of not having a (publicly) known algorithm to attack. As wiki notes (see the previous link), ‘Many ciphers are actually based on publicly known algorithms or are open source and so it is only the difficulty of obtaining the key that determines security of the system’. The above remarks were of course all based on an assumption that Eve does not here have the sort of knowledge about the encryption scheme applied that she in many cases today actually might have. There are obvious and well-known weaknesses associated with having security-associated components of a specific cryptographic scheme be independent of the key, but I do not see how it does not in this particular setting cause search space blow-up making the decision problem (did we actually guess right?) intractable in many cases. A key feature of the problem considered by the inquirer is that you here – unlike in many ‘guess the password-settings’ where for example a correct password will allow you access to an application or a document or whatever – do not get any feedback neither in the case where you guess right nor in the case where you guess wrong; it’s a decision problem, not a calculation problem. (However it is perhaps worth noting on the other hand that in a ‘standard guess-the-password-problem’ you may also sometimes implicitly face a similar decision problem due to e.g. the potential for a combination of cryptographic security and steganographic complementary strategies like e.g. these having been applied).

August 14, 2018 Posted by | Computer science, Cryptography, Data, rambling nonsense, Statistics | Leave a comment

Lyapunov Arguments in Optimization

I’d say that if you’re interested in the intersection of mathematical optimization methods/-algorithms and dynamical systems analysis it’s probably a talk well worth watching. The lecture is reasonably high-level and covers a fairly satisfactory amount of ground in a relatively short amount of time, and it is not particularly hard to follow if you have at least some passing familiarity with the fields involved (dynamical systems analysis, statistics, mathematical optimization, computer science/machine learning).

Some links:

Dynamical system.
Euler–Lagrange equation.
Continuous optimization problem.
Gradient descent algorithm.
Lyapunov stability.
Condition number.
Fast (/accelerated-) gradient descent methods.
The Mirror Descent Algorithm.
Cubic regularization of Newton method and its global performance (Nesterov & Polyak).
A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights (Su, Boyd & Candès).
A Variational Perspective on Accelerated Methods in Optimization (Wibisono, Wilson & Jordan).
Breaking Locality Accelerates Block Gauss-Seidel (Tu, Venkataraman, Wilson, Gittens, Jordan & Recht).
A Lyapunov Analysis of Momentum Methods in Optimization (Wilson, Recht & Jordan).
Bregman divergence.
Estimate sequence methods.
Variance reduction techniques.
Stochastic gradient descent.
Langevin dynamics.

 

July 22, 2018 Posted by | Computer science, Lectures, Mathematics, Physics, Statistics | Leave a comment

Big Data (II)

Below I have added a few observation from the last half of the book, as well as some coverage-related links to topics of interest.

“With big data, using correlation creates […] problems. If we consider a massive dataset, algorithms can be written that, when applied, return a large number of spurious correlations that are totally independent of the views, opinions, or hypotheses of any human being. Problems arise with false correlations — for example, divorce rate and margarine consumption […]. [W]hen the number of variables becomes large, the number of spurious correlations also increases. This is one of the main problems associated with trying to extract useful information from big data, because in doing so, as with mining big data, we are usually looking for patterns and correlations. […] one of the reasons Google Flu Trends failed in its predictions was because of these problems. […] The Google Flu Trends project hinged on the known result that there is a high correlation between the number of flu-related online searches and visits to the doctor’s surgery. If a lot of people in a particular area are searching for flu-related information online, it might then be possible to predict the spread of flu cases to adjoining areas. Since the interest is in finding trends, the data can be anonymized and hence no consent from individuals is required. Using their five-year accumulation of data, which they limited to the same time-frame as the CDC data, and so collected only during the flu season, Google counted the weekly occurrence of each of the fifty million most common search queries covering all subjects. These search query counts were then compared with the CDC flu data, and those with the highest correlation were used in the flu trends model. […] The historical data provided a baseline from which to assess current flu activity on the chosen search terms and by comparing the new real-time data against this, a classification on a scale from 1 to 5, where 5 signified the most severe, was established. Used in the 2011–12 and 2012–13 US flu seasons, Google’s big data algorithm famously failed to deliver. After the flu season ended, its predictions were checked against the CDC’s actual data. […] the Google Flu Trends algorithm over-predicted the number of flu cases by at least 50 per cent during the years it was used.” [For more details on why blind/mindless hypothesis testing/p-value hunting on big data sets is usually a terrible idea, see e.g. Burnham & Anderson, US]

“The data Google used [in the Google Flu Trends algorithm], collected selectively from search engine queries, produced results [with] obvious bias […] for example by eliminating everyone who does not use a computer and everyone using other search engines. Another issue that may have led to poor results was that customers searching Google on ‘flu symptoms’ would probably have explored a number of flu-related websites, resulting in their being counted several times and thus inflating the numbers. In addition, search behaviour changes over time, especially during an epidemic, and this should be taken into account by updating the model regularly. Once errors in prediction start to occur, they tend to cascade, which is what happened with the Google Flu Trends predictions: one week’s errors were passed along to the next week. […] [Similarly,] the Ebola prediction figures published by WHO [during the West African Ebola virus epidemic] were over 50 per cent higher than the cases actually recorded. The problems with both the Google Flu Trends and Ebola analyses were similar in that the prediction algorithms used were based only on initial data and did not take into account changing conditions. Essentially, each of these models assumed that the number of cases would continue to grow at the same rate in the future as they had before the medical intervention began. Clearly, medical and public health measures could be expected to have positive effects and these had not been integrated into the model.”

“Every time a patient visits a doctor’s office or hospital, electronic data is routinely collected. Electronic health records constitute legal documentation of a patient’s healthcare contacts: details such as patient history, medications prescribed, and test results are recorded. Electronic health records may also include sensor data such as Magnetic Resonance Imaging (MRI) scans. The data may be anonymized and pooled for research purposes. It is estimated that in 2015, an average hospital in the USA will store over 600 Tb of data, most of which is unstructured. […] Typically, the human genome contains about 20,000 genes and mapping such a genome requires about 100 Gb of data. […] The interdisciplinary field of bioinformatics has flourished as a consequence of the need to manage and analyze the big data generated by genomics. […] Cloud-based systems give authorized users access to data anywhere in the world. To take just one example, the NHS plans to make patient records available via smartphone by 2018. These developments will inevitably generate more attacks on the data they employ, and considerable effort will need to be expended in the development of effective security methods to ensure the safety of that data. […] There is no absolute certainty on the Web. Since e-documents can be modified and updated without the author’s knowledge, they can easily be manipulated. This situation could be extremely damaging in many different situations, such as the possibility of someone tampering with electronic medical records. […] [S]ome of the problems facing big data systems [include] ensuring they actually work as intended, [that they] can be fixed when they break down, and [that they] are tamper-proof and accessible only to those with the correct authorization.”

“With transactions being made through sales and auction bids, eBay generates approximately 50 Tb of data a day, collected from every search, sale, and bid made on their website by a claimed 160 million active users in 190 countries. […] Amazon collects vast amounts of data including addresses, payment information, and details of everything an individual has ever looked at or bought from them. Amazon uses its data in order to encourage the customer to spend more money with them by trying to do as much of the customer’s market research as possible. In the case of books, for example, Amazon needs to provide not only a huge selection but to focus recommendations on the individual customer. […] Many customers use smartphones with GPS capability, allowing Amazon to collect data showing time and location. This substantial amount of data is used to construct customer profiles allowing similar individuals and their recommendations to be matched. Since 2013, Amazon has been selling customer metadata to advertisers in order to promote their Web services operation […] Netflix collects and uses huge amounts of data to improve customer service, such as offering recommendations to individual customers while endeavouring to provide reliable streaming of its movies. Recommendation is at the heart of the Netflix business model and most of its business is driven by the data-based recommendations it is able to offer customers. Netflix now tracks what you watch, what you browse, what you search for, and the day and time you do all these things. It also records whether you are using an iPad, TV, or something else. […] As well as collecting search data and star ratings, Netflix can now keep records on how often users pause or fast forward, and whether or not they finish watching each programme they start. They also monitor how, when, and where they watched the programme, and a host of other variables too numerous to mention.”

“Data science is becoming a popular study option in universities but graduates so far have been unable to meet the demands of commerce and industry, where positions in data science offer high salaries to experienced applicants. Big data for commercial enterprises is concerned with profit, and disillusionment will set in quickly if an over-burdened data analyst with insufficient experience fails to deliver the expected positive results. All too often, firms are asking for a one-size-fits-all model of data scientist who is expected to be competent in everything from statistical analysis to data storage and data security.”

“In December 2016, Yahoo! announced that a data breach involving over one billion user accounts had occurred in August 2013. Dubbed the biggest ever cyber theft of personal data, or at least the biggest ever divulged by any company, thieves apparently used forged cookies, which allowed them access to accounts without the need for passwords. This followed the disclosure of an attack on Yahoo! in 2014, when 500 million accounts were compromised. […] The list of big data security breaches increases almost daily. Data theft, data ransom, and data sabotage are major concerns in a data-centric world. There have been many scares regarding the security and ownership of personal digital data. Before the digital age we used to keep photos in albums and negatives were our backup. After that, we stored our photos electronically on a hard-drive in our computer. This could possibly fail and we were wise to have back-ups but at least the files were not publicly accessible. Many of us now store data in the Cloud. […] If you store all your photos in the Cloud, it’s highly unlikely with today’s sophisticated systems that you would lose them. On the other hand, if you want to delete something, maybe a photo or video, it becomes difficult to ensure all copies have been deleted. Essentially you have to rely on your provider to do this. Another important issue is controlling who has access to the photos and other data you have uploaded to the Cloud. […] although the Internet and Cloud-based computing are generally thought of as wireless, they are anything but; data is transmitted through fibre-optic cables laid under the oceans. Nearly all digital communication between continents is transmitted in this way. My email will be sent via transatlantic fibre-optic cables, even if I am using a Cloud computing service. The Cloud, an attractive buzz word, conjures up images of satellites sending data across the world, but in reality Cloud services are firmly rooted in a distributed network of data centres providing Internet access, largely through cables. Fibre-optic cables provide the fastest means of data transmission and so are generally preferable to satellites.”

Links:

Health care informatics.
Electronic health records.
European influenza surveillance network.
Overfitting.
Public Health Emergency of International Concern.
Virtual Physiological Human project.
Watson (computer).
Natural language processing.
Anthem medical data breach.
Electronic delay storage automatic calculator (EDSAC). LEO (computer). ICL (International Computers Limited).
E-commerce. Online shopping.
Pay-per-click advertising model. Google AdWords. Click fraud. Targeted advertising.
Recommender system. Collaborative filtering.
Anticipatory shipping.
BlackPOS Malware.
Data Encryption Standard algorithm. EFF DES cracker.
Advanced Encryption Standard.
Tempora. PRISM (surveillance program). Edward Snowden. WikiLeaks. Tor (anonymity network). Silk Road (marketplace). Deep web. Internet of Things.
Songdo International Business District. Smart City.
United Nations Global Pulse.

July 19, 2018 Posted by | Books, Computer science, Cryptography, Data, Engineering, Epidemiology, Statistics | Leave a comment

Big Data (I?)

Below a few observations from the first half of the book, as well as some links related to the topic coverage.

“The data we derive from the Web can be classified as structured, unstructured, or semi-structured. […] Carefully structured and tabulated data is relatively easy to manage and is amenable to statistical analysis, indeed until recently statistical analysis methods could be applied only to structured data. In contrast, unstructured data is not so easily categorized, and includes photos, videos, tweets, and word-processing documents. Once the use of the World Wide Web became widespread, it transpired that many such potential sources of information remained inaccessible because they lacked the structure needed for existing analytical techniques to be applied. However, by identifying key features, data that appears at first sight to be unstructured may not be completely without structure. Emails, for example, contain structured metadata in the heading as well as the actual unstructured message […] and so may be classified as semi-structured data. Metadata tags, which are essentially descriptive references, can be used to add some structure to unstructured data. […] Dealing with unstructured data is challenging: since it cannot be stored in traditional databases or spreadsheets, special tools have had to be developed to extract useful information. […] Approximately 80 per cent of the world’s data is unstructured in the form of text, photos, and images, and so is not amenable to the traditional methods of structured data analysis. ‘Big data’ is now used to refer not just to the total amount of data generated and stored electronically, but also to specific datasets that are large in both size and complexity, with which new algorithmic techniques are required in order to extract useful information from them.”

“In the digital age we are no longer entirely dependent on samples, since we can often collect all the data we need on entire populations. But the size of these increasingly large sets of data cannot alone provide a definition for the term ‘big data’ — we must include complexity in any definition. Instead of carefully constructed samples of ‘small data’ we are now dealing with huge amounts of data that has not been collected with any specific questions in mind and is often unstructured. In order to characterize the key features that make data big and move towards a definition of the term, Doug Laney, writing in 2001, proposed using the three ‘v’s: volume, variety, and velocity. […] ‘Volume’ refers to the amount of electronic data that is now collected and stored, which is growing at an ever-increasing rate. Big data is big, but how big? […] Generally, we can say the volume criterion is met if the dataset is such that we cannot collect, store, and analyse it using traditional computing and statistical methods. […] Although a great variety of data [exists], ultimately it can all be classified as structured, unstructured, or semi-structured. […] Velocity is necessarily connected with volume: the faster the data is generated, the more there is. […] Velocity also refers to the speed at which data is electronically processed. For example, sensor data, such as that generated by an autonomous car, is necessarily generated in real time. If the car is to work reliably, the data […] must be analysed very quickly […] Variability may be considered as an additional dimension of the velocity concept, referring to the changing rates in flow of data […] computer systems are more prone to failure [during peak flow periods]. […] As well as the original three ‘v’s suggested by Laney, we may add ‘veracity’ as a fourth. Veracity refers to the quality of the data being collected. […] Taken together, the four main characteristics of big data – volume, variety, velocity, and veracity – present a considerable challenge in data management.” [As regular readers of this blog might be aware, not everybody would agree with the author here about the inclusion of veracity as a defining feature of big data – “Many have suggested that there are more V’s that are important to the big data problem [than volume, variety & velocity] such as veracity and value (IEEE BigData 2013). Veracity refers to the trustworthiness of the data, and value refers to the value that the data adds to creating knowledge about a topic or situation. While we agree that these are important data characteristics, we do not see these as key features that distinguish big data from regular data. It is important to evaluate the veracity and value of all data, both big and small. (Knoth & Schmid)]

“Anyone who uses a personal computer, laptop, or smartphone accesses data stored in a database. Structured data, such as bank statements and electronic address books, are stored in a relational database. In order to manage all this structured data, a relational database management system (RDBMS) is used to create, maintain, access, and manipulate the data. […] Once […] the database [has been] constructed we can populate it with data and interrogate it using structured query language (SQL). […] An important aspect of relational database design involves a process called normalization which includes reducing data duplication to a minimum and hence reduces storage requirements. This allows speedier queries, but even so as the volume of data increases the performance of these traditional databases decreases. The problem is one of scalability. Since relational databases are essentially designed to run on just one server, as more and more data is added they become slow and unreliable. The only way to achieve scalability is to add more computing power, which has its limits. This is known as vertical scalability. So although structured data is usually stored and managed in an RDBMS, when the data is big, say in terabytes or petabytes and beyond, the RDBMS no longer works efficiently, even for structured data. An important feature of relational databases and a good reason for continuing to use them is that they conform to the following group of properties: atomicity, consistency, isolation, and durability, usually known as ACID. Atomicity ensures that incomplete transactions cannot update the database; consistency excludes invalid data; isolation ensures one transaction does not interfere with another transaction; and durability means that the database must update before the next transaction is carried out. All these are desirable properties but storing and accessing big data, which is mostly unstructured, requires a different approach. […] given the current data explosion there has been intensive research into new storage and management techniques. In order to store these massive datasets, data is distributed across servers. As the number of servers involved increases, the chance of failure at some point also increases, so it is important to have multiple, reliably identical copies of the same data, each stored on a different server. Indeed, with the massive amounts of data now being processed, systems failure is taken as inevitable and so ways of coping with this are built into the methods of storage.”

“A distributed file system (DFS) provides effective and reliable storage for big data across many computers. […] Hadoop DFS [is] one of the most popular DFS […] When we use Hadoop DFS, the data is distributed across many nodes, often tens of thousands of them, physically situated in data centres around the world. […] The NameNode deals with all requests coming in from a client computer; it distributes storage space, and keeps track of storage availability and data location. It also manages all the basic file operations (e.g. opening and closing files) and controls data access by client computers. The DataNodes are responsible for actually storing the data and in order to do so, create, delete, and replicate blocks as necessary. Data replication is an essential feature of the Hadoop DFS. […] It is important that several copies of each block are stored so that if a DataNode fails, other nodes are able to take over and continue with processing tasks without loss of data. […] Data is written to a DataNode only once but will be read by an application many times. […] One of the functions of the NameNode is to determine the best DataNode to use given the current usage, ensuring fast data access and processing. The client computer then accesses the data block from the chosen node. DataNodes are added as and when required by the increased storage requirements, a feature known as horizontal scalability. One of the main advantages of Hadoop DFS over a relational database is that you can collect vast amounts of data, keep adding to it, and, at that time, not yet have any clear idea of what you want to use it for. […] structured data with identifiable rows and columns can be easily stored in a RDBMS while unstructured data can be stored cheaply and readily using a DFS.”

NoSQL is the generic name used to refer to non-relational databases and stands for Not only SQL. […] The non-relational model has some features that are necessary in the management of big data, namely scalability, availability, and performance. With a relational database you cannot keep scaling vertically without loss of function, whereas with NoSQL you scale horizontally and this enables performance to be maintained. […] Within the context of a distributed database system, consistency refers to the requirement that all copies of data should be the same across nodes. […] Availability requires that if a node fails, other nodes still function […] Data, and hence DataNodes, are distributed across physically separate servers and communication between these machines will sometimes fail. When this occurs it is called a network partition. Partition tolerance requires that the system continues to operate even if this happens. In essence, what the CAP [Consistency, Availability, Partition Tolerance] Theorem states is that for any distributed computer system, where the data is shared, only two of these three criteria can be met. There are therefore three possibilities; the system must be: consistent and available, consistent and partition tolerant, or partition tolerant and available. Notice that since in a RDMS the network is not partitioned, only consistency and availability would be of concern and the RDMS model meets both of these criteria. In NoSQL, since we necessarily have partitioning, we have to choose between consistency and availability. By sacrificing availability, we are able to wait until consistency is achieved. If we choose instead to sacrifice consistency it follows that sometimes the data will differ from server to server. The somewhat contrived acronym BASE (Basically Available, Soft, and Eventually consistent) is used as a convenient way of describing this situation. BASE appears to have been chosen in contrast to the ACID properties of relational databases. ‘Soft’ in this context refers to the flexibility in the consistency requirement. The aim is not to abandon any one of these criteria but to find a way of optimizing all three, essentially a compromise. […] The name NoSQL derives from the fact that SQL cannot be used to query these databases. […] There are four main types of non-relational or NoSQL database: key-value, column-based, document, and graph – all useful for storing large amounts of structured and semi-structured data. […] Currently, an approach called NewSQL is finding a niche. […] the aim of this latent technology is to solve the scalability problems associated with the relational model, making it more useable for big data.”

“A popular way of dealing with big data is to divide it up into small chunks and then process each of these individually, which is basically what MapReduce does by spreading the required calculations or queries over many, many computers. […] Bloom filters are particularly suited to applications where storage is an issue and where the data can be thought of as a list. The basic idea behind Bloom filters is that we want to build a system, based on a list of data elements, to answer the question ‘Is X in the list?’ With big datasets, searching through the entire set may be too slow to be useful, so we use a Bloom filter which, being a probabilistic method, is not 100 per cent accurate—the algorithm may decide that an element belongs to the list when actually it does not; but it is a fast, reliable, and storage efficient method of extracting useful knowledge from data. Bloom filters have many applications. For example, they can be used to check whether a particular Web address leads to a malicious website. In this case, the Bloom filter would act as a blacklist of known malicious URLs against which it is possible to check, quickly and accurately, whether it is likely that the one you have just clicked on is safe or not. Web addresses newly found to be malicious can be added to the blacklist. […] A related example is that of malicious email messages, which may be spam or may contain phishing attempts. A Bloom filter provides us with a quick way of checking each email address and hence we would be able to issue a timely warning if appropriate. […] they can [also] provide a very useful way of detecting fraudulent credit card transactions.”

Links:

Data.
Punched card.
Clickstream log.
HTTP cookie.
Australian Square Kilometre Array Pathfinder.
The Millionaire Calculator.
Data mining.
Supervised machine learning.
Unsupervised machine learning.
Statistical classification.
Cluster analysis.
Moore’s Law.
Cloud storage. Cloud computing.
Data compression. Lossless data compression. Lossy data compression.
ASCII. Huffman algorithm. Variable-length encoding.
Data compression ratio.
Grayscale.
Discrete cosine transform.
JPEG.
Bit array. Hash function.
PageRank algorithm.
Common crawl.

July 14, 2018 Posted by | Books, Computer science, Data, Statistics | Leave a comment

Frontiers in Statistical Quality Control (I)

The XIth International Workshop on Intelligent Statistical Quality Control took place in Sydney, Australia from August 20 to August 23, 2013. […] The 23 papers in this volume were carefully selected by the scientific program committee, reviewed by its members, revised by the authors and, finally, adapted by the editors for this volume. The focus of the book lies on three major areas of statistical quality control: statistical process control (SPC), acceptance sampling and design of experiments. The majority of the papers deal with statistical process control while acceptance sampling, and design of experiments are treated to a lesser extent.”

I’m currently reading this book. It’s quite technical and a bit longer than many of the other non-fiction books I’ve read this year (…but shorter than others; however it is still ~400 pages of content exclusively devoted to statistical papers), so it may take me a while to finish it. I figured the fact that I may not finish the book in a while was not a good argument against blogging relevant sections of the book now, especially as it’s already been some time since I read the first few chapters.

When reading a book like this one I care a lot more about understanding the concepts than about understanding the proofs, so as usual the amount of math included in the post is limited; please don’t assume it’s because there are no equations in the book.

Below I have added some ideas and observations from the first 100 pages or so of the book’s coverage.

“A growing number of [statistical quality control] applications involve monitoring with rare event data. […] The most common approaches for monitoring such processes involve using an exponential distribution to model the time between the events or using a Bernoulli distribution to model whether or not each opportunity for the event results in its occurrence. The use of a sequence of independent Bernoulli random variables leads to a geometric distribution for the number of non-occurrences between the occurrences of the rare events. One surveillance method is to use a power transformation on the exponential or geometric observations to achieve approximate normality of the in control distribution and then use a standard individuals control chart. We add to the argument that use of this approach is very counterproductive and cover some alternative approaches. We discuss the choice of appropriate performance metrics. […] Most often the focus is on detecting process deterioration, i.e., an increase in the probability of the adverse event or a decrease in the average time between events. Szarka and Woodall (2011) reviewed the extensive number of methods that have been proposed for monitoring processes using Bernoulli data. Generally, it is difficult to better the performance of the Bernoulli cumulative sum (CUSUM) chart of Reynolds and Stoumbos (1999). The Bernoulli and geometric CUSUM charts can be designed to be equivalent […] Levinson (2011) argued that control charts should not be used with healthcare rare event data because in many situations there is an assignable cause for each error, e.g., each hospital-acquired infection or serious prescription error, and each incident should be investigated. We agree that serious adverse events should be investigated whether or not they result in a control chart signal. The investigation of rare adverse events, however, and the implementation of process improvements to prevent future such errors, does not preclude using a control chart to determine if the rate of such events has increased or decreased over time. In fact, a control chart can be used to evaluate the success of any process improvement initiative.”

“The choice of appropriate performance metrics for comparing surveillance schemes for monitoring Bernoulli and exponential data is quite important. The usual Average Run Length (ARL) metric refers to the average number of points plotted on the chart until a signal is given. This metric is most clearly appropriate when the time between the plotted points is constant. […] In some cases, such as in monitoring the number of near-miss accidents, it may be informative to use a metric that reflects the actual time required to obtain an out-of-control signal. Thus one can consider the number of Bernoulli trials until an out-of-control signal is given for Bernoulli data, leading to its average, the ANOS. The ANOS will be proportional to the average time before a signal if the rate at which the Bernoulli trials are observed is constant over time. For exponentially distributed data one could consider the average time to signal, the ATS. If the process is stable, then ANOS = ARL / p and ATS = ARS * θ, where p and θ are the Bernoulli probability and the exponential mean, respectively. […] To assess out-of-control performance we believe it is most realistic to consider steady-state performance where the shift in the parameter occurs at some time after monitoring has begun. […] Under this scenario one cannot easily convert the ARL metric to the ANOS and ATS metrics. Consideration of steady state performance of competing methods is important because some methods have an implicit headstart feature that results in good zero-state performance, but poor steady-state performance.”

“Data aggregation is frequently done when monitoring rare events and for count data generally. For example, one might monitor the number of accidents per month in a plant or the number of patient falls per week in a hospital. […] Schuh et al. (2013) showed […] that there can be significantly long expected delays in detecting process deterioration when data are aggregated over time even when there are few samples with zero events. One can always aggregate data over long enough time periods to avoid zero counts, but the consequence is slower detection of increases in the rate of the adverse event. […] aggregating event data over fixed time intervals, as frequently done in practice, can result in significant delays in detecting increases in the rate of adverse events. […] Another type of aggregation is to wait until one has observed a given number of events before updating a control chart based on a proportion or waiting time. […] This type of aggregation […] does not appear to delay the detection of process changes nearly as much as aggregating data over fixed time periods. […] We believe that the adverse effect of aggregating data over time has not been fully appreciated in practice and more research work is needed on this topic. Only a couple of the most basic scenarios for count data have been studied. […] Virtually all of the work on monitoring the rate of rare events is based on the assumption that there is a sustained shift in the rate. In some applications the rate change may be transient. In this scenario other performance metrics would be needed, such as the probability of detecting the process shift during the transient period. The effect of data aggregation over time might be larger if shifts in the parameter are not sustained.”

Big data is a popular term that is used to describe the large, diverse, complex and/or longitudinal datasets generated from a variety of instruments, sensors and/or computer-based transactions. […] The acquisition of data does not automatically transfer to new knowledge about the system under study. […] To be able to gain knowledge from big data, it is imperative to understand both the scale and scope of big data. The challenges with processing and analyzing big data are not only limited to the size of the data. These challenges include the size, or volume, as well as the variety and velocity of the data (Zikopoulos et al. 2012). Known as the 3V’s, the volume, variety, and/or velocity of the data are the three main characteristics that distinguish big data from the data we have had in the past. […] Many have suggested that there are more V’s that are important to the big data problem such as veracity and value (IEEE BigData 2013). Veracity refers to the trustworthiness of the data, and value refers to the value that the data adds to creating knowledge about a topic or situation. While we agree that these are important data characteristics, we do not see these as key features that distinguish big data from regular data. It is important to evaluate the veracity and value of all data, both big and small. Both veracity and value are related to the concept of data quality, an important research area in the Information Systems (IS) literature for more than 50 years. The research literature discussing the aspects and measures of data quality is extensive in the IS field, but seems to have reached a general agreement that the multiple aspects of data quality can be grouped into several broad categories […]. Two of the categories relevant here are contextual and intrinsic dimensions of data quality. Contextual aspects of data quality are context specific measures that are subjective in nature, including concepts like value-added, believability, and relevance. […] Intrinsic aspects of data quality are more concrete in nature, and include four main dimensions: accuracy, timeliness, consistency, and completeness […] From our perspective, many of the contextual and intrinsic aspects of data quality are related to the veracity and value of the data. That said, big data presents new challenges in conceptualizing, evaluating, and monitoring data quality.”

The application of SPC methods to big data is similar in many ways to the application of SPC methods to regular data. However, many of the challenges inherent to properly studying and framing a problem can be more difficult in the presence of massive amounts of data. […] it is important to note that building the model is not the end-game. The actual use of the analysis in practice is the goal. Thus, some consideration needs to be given to the actual implementation of the statistical surveillance applications. This brings us to another important challenge, that of the complexity of many big data applications. SPC applications have a tradition of back of the napkin methods. The custom within SPC practice is the use of simple methods that are easy to explain like the Shewhart control chart. These are often the best methods to use to gain credibility because they are easy to understand and easy to explain to a non-statistical audience. However, big data often does not lend itself to easy-to-compute or easy-to-explain methods. While a control chart based on a neural net may work well, it may be so difficult to understand and explain that it may be abandoned for inferior, yet simpler methods. Thus, it is important to consider the dissemination and deployment of advanced analytical methods in order for them to be effectively used in practice. […] Another challenge in monitoring high dimensional data sets is the fact that not all of the monitored variables are likely to shift at the same time; thus, some method is necessary to identify the process variables that have changed. In high dimensional data sets, the decomposition methods used with multivariate control charts can become very computationally expensive. Several authors have considered variable selection methods combined with control charts to quickly detect process changes in a variety of practical scenarios including fault detection, multistage processes, and profile monitoring. […] All of these methods based on variable selection techniques are based on the idea of monitoring subsets of potentially faulty variables. […] Some variable reduction methods are needed to better identify shifts. We believe that further work in the areas combining variable selection methods and surveillance are important for quickly and efficiently diagnosing changes in high-dimensional data.

“A multiple stream process (MSP) is a process that generates several streams of output. From the statistical process control standpoint, the quality variable and its specifications are the same in all streams. A classical example is a filling process such as the ones found in beverage, cosmetics, pharmaceutical and chemical industries, where a filler machine may have many heads. […] Although multiple-stream processes are found very frequently in industry, the literature on schemes for the statistical control of such kind of processes is far from abundant. This paper presents a survey of the research on this topic. […] The first specific techniques for the statistical control of MSPs are the group control charts (GCCs) […] Clearly the chief motivation for these charts was to avoid the proliferation of control charts that would arise if every stream were controlled with a separate pair of charts (one for location and other for spread). Assuming the in-control distribution of the quality variable to be the same in all streams (an assumption which is sometimes too restrictive), the control limits should be the same for every stream. So, the basic idea is to build only one chart (or a pair of charts) with the information from all streams.”

“The GCC will work well if the values of the quality variable in the different streams are independent and identically distributed, that is, if there is no cross-correlation between streams. However, such an assumption is often unrealistic. In many real multiple-stream processes, the value of the observed quality variable is typically better described as the sum of two components: a common component (let’s refer to it as “mean level”), exhibiting variation that affects all streams in the same way, and the individual component of each stream, which corresponds to the difference between the stream observation and the common mean level. […] [T]he presence of the mean level component leads to reduced sensitivity of Boyd’s GCC to shifts in the individual component of a stream if the variance […] of the mean level is large with respect to the variance […] of the individual stream components. Moreover, the GCC is a Shewhart-type chart; if the data exhibit autocorrelation, the traditional form of estimating the process standard deviation (for establishing the control limits) based on the average range or average standard deviation of individual samples (even with the Bonferroni or Dunn-Sidak correction) will result in too frequent false alarms, due to the underestimation of the process total variance. […] [I]in the converse situation […] the GCC will have little sensitivity to causes that affect all streams — at least, less sensitivity than would have a chart on the average of the measurements across all streams, since this one would have tighter limits than the GCC. […] Therefore, to monitor MSPs with the two components described, Mortell and Runger (1995) proposed using two control charts: First, a chart for the grand average between streams, to monitor the mean level. […] For monitoring the individual stream components, they proposed using a special range chart (Rt chart), whose statistic is the range between streams, that is, the difference between the largest stream average and the smallest stream average […] the authors commented that both the chart on the average of all streams and the Rt chart can be used even when at each sampling time only a subset of the streams are sampled (provided that the number of streams sampled remains constant). The subset can be varied periodically or even chosen at random. […] it is common in practice to measure only a subset of streams at each sampling time, especially when the number of streams is large. […] Although almost the totality of Mortell and Runger’s paper is about the monitoring of the individual streams, the importance of the chart on the average of all streams for monitoring the mean level of the process cannot be overemphasized.”

“Epprecht and Barros (2013) studied a filling process application where the stream variances were similar, but the stream means differed, wandered, changed from day to day, were very difficult to adjust, and the production runs were too short to enable good estimation of the parameters of the individual streams. The solution adopted to control the process was to adjust the target above the nominal level to compensate for the variation between streams, as a function of the lower specification limit, of the desired false-alarm rate and of a point (shift, power) arbitrarily selected. This would be a MSP version of “acceptance control charts” (Montgomery 2012, Sect. 10.2) if taking samples with more than one observation per stream [is] feasible.”

Most research works consider a small to moderate number of streams. Some processes may have hundreds of streams, and in this case the issue of how to control the false-alarm rate while keeping enough detection power […] becomes a real problem. […] Real multiple-stream processes can be very ill-behaved. The author of this paper has seen a plant with six 20-stream filling processes in which the stream levels had different means and variances and could not be adjusted separately (one single pump and 20 hoses). For many real cases with particular twists like this one, it happens that no previous solution in the literature is applicable. […] The appropriateness and efficiency of [different monitoring methods] depends on the dynamic behaviour of the process over time, on the degree of cross-correlation between streams, on the ratio between the variabilities of the individual streams and of the common component (note that these three factors are interrelated), on the type and size of shifts that are likely and/or relevant to detect, on the ease or difficulty to adjust all streams in the same target, on the process capability, on the number of streams, on the feasibility of taking samples of more than one observation per stream at each sampling time (or even the feasibility of taking one observation of every stream at each sampling time!), on the length of the production runs, and so on. So, the first problem in a practical application is to characterize the process and select the appropriate monitoring scheme (or to adapt one, or to develop a new one). This analysis may not be trivial for the average practitioner in industry. […] Jirasettapong and Rojanarowan (2011) is the only work I have found on the issue of selecting the most suitable monitoring scheme for an MSP. It considers only a limited number of alternative schemes and a few aspects of the problem. More comprehensive analyses are needed.”

June 27, 2018 Posted by | Books, Data, Engineering, Statistics | Leave a comment

Alcohol and Aging (II)

I gave the book 3 stars on goodreads.

As is usual for publications of this nature, the book includes many chapters that cover similar topics and so the coverage can get a bit repetitive if you’re reading it from cover to cover the way I did; most of the various chapter authors obviously didn’t read the other contributions included in the book, and as each chapter is meant to stand on its own you end up with a lot of chapter introductions which cover very similar topics. If you can disregard such aspects it’s a decent book, which covers a wide variety of topics.

Below I have added some observations from some of the chapters of the book which I did not cover in my first post.

It is widely accepted that consuming heavy amounts of alcohol and binge drinking are detrimental to the brain. Animal studies that have examined the anatomical changes that occur to the brain as a consequence of consuming alcohol indicate that heavy alcohol consumption and binge drinking leads to the death of existing neurons [10, 11] and prevents production of new neurons [12, 13]. […] While animal studies indicate that consuming even moderate amounts of alcohol is detrimental to the brain, the evidence from epidemiological studies is less clear. […] Epidemiological studies that have examined the relationship between late life alcohol consumption and cognition have frequently reported that older adults who consume light to moderate amounts of alcohol are less likely to develop dementia and have higher cognitive functioning compared to older adults who do not consume alcohol. […] In a meta-analysis of 15 prospective cohort studies, consuming light to moderate amounts of alcohol was associated with significantly lower relative risk (RR) for Alzheimer’s disease (RR=0.72, 95% CI=0.61–0.86), vascular dementia (RR=0.75, 95% CI=0.57–0.98), and any type of dementia (RR=0.74, 95% CI=0.61–0.91), but not cognitive decline (RR=0.28, 95 % CI=0.03–2.83) [31]. These findings are consistent with a previous meta-analysis by Peters et al. [33] in which light to moderate alcohol consumption was associated with a decreased risk for dementia (RR=0.63, 95 % CI=0.53–0.75) and Alzheimer’s disease (RR=0.57, 95 % CI=0.44–0.74), but not vascular dementia (RR=0.82, 95% CI=0.50–1.35) or cognitive decline RR=0.89, 95% CI=0.67–1.17). […] Mild cognitive impairment (MCI) has been used to describe the prodromal stage of Alzheimer’s disease […]. There is no strong evidence to suggest that consuming alcohol is protective against MCI [39, 40] and several studies have reported non-significant findings [41–43].”

The majority of research on the relationship between alcohol consumption and cognitive outcomes has focused on the amount of alcohol consumed during old age, but there is a growing body of research that has examined the relationship between alcohol consumption during middle age and cognitive outcomes several years or decades later. The evidence from this area of research is mixed with some studies not detecting a significant relationship [17, 58, 59], while others have reported that light to moderate alcohol consumption is associated with preserved cognition [60] and decreased risk for cognitive impairment [31, 61, 62]. […] Several epidemiological studies have reported that light to moderate alcohol consumption is associated with a decreased risk for stroke, diabetes, and heart disease [36, 84, 85]. Similar to the U-shaped relationship between alcohol consumption and dementia, heavy alcohol consumption has been associated with poor health [86, 87]. The decreased risk for several metabolic and vascular health conditions for alcohol consumers has been attributed to antioxidants [54], greater concentrations of high-density lipoprotein cholesterol in the bloodstream [88], and reduced blood clot formation [89]. Stroke, diabetes, heart disease, and related conditions have all been associated with lower cognitive functioning during old age [90, 91]. The reduced prevalence of metabolic and vascular health conditions among light to moderate alcohol consumers may contribute to the decreased risk for dementia and cognitive decline for older adults who consume alcohol. A limitation of the hypothesis that the reduced risk for dementia among light and moderate alcohol consumers is conferred through the reduced prevalence of adverse health conditions associated with dementia is the possibility that this relationship is confounded by reverse causality. Alcohol consumption decreases with advancing age and adults may reduce their alcohol consumption in response to the onset of adverse health conditions […] the higher prevalence of dementia and lower cognitive functioning among abstainers may be due in part to their worse health rather than their alcohol consumption.”

A limitation of large cohort studies is that subjects who choose not to participate or are unable to participate are often less healthy than those who do participate. Non-response bias becomes more pronounced with age because only subjects who have survived to old age and are healthy enough to participate are observed. Studies on alcohol consumption and cognition are sensitive to non-response bias because light and moderate drinkers who are not healthy enough to participate in the study will not be observed. Adults who survive to old age despite consuming very high amounts of alcohol represent an even more select segment of the general population because they may have genetic, behavioral, health, social, or other factors that protect them against the negative effects of heavy alcohol consumption. As a result, the analytic sample of epidemiological studies is more likely to be comprised of “healthy” drinkers, which biases results in favor of finding a positive effect of light to moderate alcohol consumption for cognition and health in general. […] The incidence of Alzheimer’s disease doubles every 5 years after 65 years of age [94] and nearly 40% of older adults aged 85 and over are diagnosed with Alzheimer’s disease [7]. The relatively old age of onset for most dementia cases means the observed protective effect of light to moderate alcohol consumption for dementia may be due to alcohol consumers being more likely to die or drop out of a study as a result of their alcohol consumption before they develop dementia. This bias may be especially strong for heavy alcohol consumers. Not properly accounting for death as a competing outcome has been observed to artificially increase the risk of dementia among older adults with diabetes [95] and the effect that death and other competing outcomes may have on the relationship between alcohol consumption and dementia risk is unclear. […] The majority of epidemiological studies that have studied the relationship between alcohol consumption and cognition treat abstainers as the reference category. This can be problematic because often times the abstainer or non-drinking category includes older adults who stopped consuming alcohol because of poor health […] Not differentiating former alcohol consumers from lifelong abstainers has been found to explain some but not all of the benefit of alcohol consumption for preventing mortality from cardiovascular causes [96].”

“It is common for people to engage in other behaviors while consuming alcohol. This complicates the relationship between alcohol consumption and cognition because many of the behaviors associated with alcohol consumption are positively and negatively associated with cognitive functioning. For example, alcohol consumers are more likely to smoke than non-drinkers [104] and smoking has been associated with an increased risk for dementia and cognitive decline [105]. […] The relationship between alcohol consumption and cognition may also differ between people with or without a history of mental illness. Depression reduces the volume of the hippocampus [106] and there is growing evidence that depression plays an important role in dementia. Depression during middle age is recognized as a risk factor for dementia [107], and high depressive symptoms during old age may be an early symptom of dementia [108]. Middle aged adults with depression or other mental illness who self-medicate with alcohol may be at especially high risk for dementia later in life because of synergistic effects that alcohol and depression has on the brain. […] While current evidence from epidemiological studies indicates that consuming light to moderate amounts of alcohol, in particular wine, does not negatively affect cognition and in many cases is associated with cognitive health, adults who do not consume alcohol should not be encouraged to increase their alcohol consumption until further research clarifies these relationships. Inconsistencies between studies on how alcohol consumption categories are defined make it difficult to determine the “optimal” amount of alcohol consumption to prevent dementia. It is likely that the optimal amount of alcohol varies according to a person’s gender, as well as genetic, physiological, behavioral, and health characteristics, making the issue extremely complex.”

Falls are the leading cause of both fatal and nonfatal injuries among older adults, with one in three older adults falling each year, and 20–30% of people who fall suffer moderate to severe injuries such as lacerations, hip fractures, and head traumas. In fact, falls are the foremost cause of both fractures and traumatic brain injury (TBI) among older adults […] In 2013, 2.5 million nonfatal falls among older adults were treated in ED and more than 734,000 of these patients were hospitalized. […] Our analysis of the 2012 Nationwide Emergency Department Sample (NEDS) data set show that fall-related injury was a presenting problem among 12% of all ED visits by those aged 65+, with significant differences among age groups: 9% among the 65–74 age group, 12 % among the 75–84 age group, and 18 % among the 85+ age group [4]. […] heavy alcohol use predicts fractures. For example, among those 55+ years old in a health survey in England, men who consumed more than 8 units of alcohol and women who consumed more than 6 units on their heaviest drinking day in the past week had significantly increased odds of fractures (OR =1.65, 95% CI =1.37–1.98 for men and OR=2.07, 95% CI =1.28–3.35 for women) [63]. […] The 2008–2009 Canadian Community Health Survey-Healthy Aging also showed that consumption of at least one alcoholic drink per week increased the odds of falling by 40 % among those 65+ years [57].”

I at first was not much impressed by the effect sizes mentioned above because there are surely 100 relevant variables they didn’t account for/couldn’t account for, but then I thought a bit more about it. An important observation here – they don’t mention it in the coverage, but it sprang to mind – is that if sick or frail elderly people consume less alcohol than their more healthy counterparts, and are more likely to not consume alcohol (which they do, and which they are, we know this), and if frail or sick(er) elderly people are more likely to suffer a fall/fracture than are people who are relatively healthy (they are, again, we know this), well, then you’d expect consumption of alcohol to be found to have a ‘protective effect’ simply due to confounding by (reverse) indication (unless the researchers were really careful about adjusting for such things, but no such adjustments are mentioned in the coverage, which makes sense as these are just raw numbers being reported). The point is that the null here should not be that ‘these groups should be expected to have the same fall rate/fracture rate’, but rather ‘people who drink alcohol should be expected to be doing better, all else equal’ – but they aren’t, quite the reverse. So ‘the true effect size’ here may be larger than what you’d think.

I’m reasonably sure things are a lot more complicated than the above makes it appear (because of those 100 relevant variables we were talking about…), but I find it interesting anyway. Two more things to note: 1. Have another look at the numbers above if they didn’t sink in the first time. This is more than 10% of emergency department visits for that age group. Falls are a really big deal. 2. Fractures in the elderly are also a potentially really big deal. Here’s a sample quote: “One-fifth of hip fracture victims will die within 6 months of the injury, and only 50% will return to their previous level of independence.” (link). In some contexts, a fall is worse news than a cancer diagnosis, and they are very common events in the elderly. This also means that even relatively small effect sizes here can translate into quite large public health effects, because baseline incidence is so high.

The older adult population is a disproportionate consumer of prescription and over-the-counter medications. In a nationally representative sample of community-dwelling adults aged 57–84 years from the National Social Life, Health, and Aging Project (NSHAP) in 2005–2006, 81 % regularly used at least one prescription medication on a regular basis and 29% used at least five prescription medications. Forty-two percent used at least one nonprescription medication and concurrent use with a prescription medication was common, with 46% of prescription medication users also using OTC medications [2]. Prescription drug use by older adults in the U.S. is also growing. The percentage of older adults taking at least one prescription drug in the last 30 days increased from 73.6% in 1988–1994 to 89.7 % in 2007–2010 and the percentage taking five or more prescription drugs in the last 30 days increased from 13.8% in 1988–1994 to 39.7 % in 2007–2010 [3].”

The aging process can affect the response to a medication by altering its pharmacokinetics and pharmacodynamics [9, 10]. Reduced gastrointestinal motility and gastric acidity can alter the rate or extent of drug absorption. Changes in body composition, including decreased total body water and increased body fat can alter drug distribution. For alcohol, changes in body composition result in higher blood alcohol levels in older adults compared to younger adults after the same dose or quantity  of alcohol consumed. Decreased size of the liver, hepatic blood flow, and function of Phase I (oxidation, reduction, and hydrolysis) metabolic pathways result in reduced drug metabolism and increased drug exposure for drugs that undergo Phase I metabolism. Phase II hepatic metabolic pathways are generally preserved with aging. Decreased size of the kidney, renal blood flow, and glomerular filtration result in slower elimination of medications and metabolites by the kidney and increased drug exposure for medications that undergo renal elimination. Age-related impairment of homeostatic mechanisms and changes in receptor number and function can result in changes in pharmacodynamics as well. Older adults are generally more sensitive to the effects of medications and alcohol which act on the central nervous system for example. The consequences of these physiologic changes with aging are that older adults often experience increased drug exposure for the same dose (higher drug concentrations over time) and increased sensitivity to medications (greater response at a given drug concentration) than their younger counterparts.”

“Aging-related changes in physiology are not the only sources of variability in pharmacokinetics and pharmacodynamics that must be considered for an individual person. Older adults experience more chronic diseases that may decrease drug metabolism and renal elimination than younger cohorts. Frailty may result in further decline in drug metabolism, including Phase II metabolic pathways in the liver […] Drug interactions must also be considered […] A drug interaction is defined as a clinically meaningful change in the effect of one drug when coadministered with another drug [12]. Many drugs, including alcohol, have the potential for a drug interaction when administered concurrently, but whether a clinically meaningful change in effect occurs for a specific person depends on patient-specifc factors including age. Drug interactions are generally classified as pharmacokinetic interactions, where one drug alters the absorption, distribution, metabolism, or elimination of another drug resulting in increased or decreased drug exposure, or pharmacodynamic interactions, where one drug alters the response to another medication through additive or antagonistic pharmacologic effects [13]. An adverse drug event occurs when a pharmacokinetic or pharmacodynamic interaction or combination of both results in changes in drug exposure or response that lead to negative clinical outcomes. The adverse drug event could be a therapeutic failure if drug exposure is decreased or the pharmacologic response is antagonistic. The adverse drug event could be drug toxicity if the drug exposure is increased or the pharmacologic response is additive or synergistic. The threshold for experiencing an adverse event is often lower in older adults due to physiologic changes with aging and medical comorbidities, increasing their risk of experiencing an adverse drug event when medications are taken concurrently.”

“A large number of potential medication–alcohol interactions have been reported in the literature. Mechanisms of these interactions range from pharmacokinetic interactions affecting either alcohol or medication exposure to pharmacodynamics interactions resulting in exaggerated response. […] Epidemiologic evidence suggests that concurrent use of alcohol and medications among older adults is common. […] In a nationally representative U.S. sample of community-dwelling older adults in the National Social Life, Health and Aging Project (NSHAP) 2005–2006, 41% of participants reported consuming alcohol at least once per week and 20% were at risk for an alcohol–medication interaction because they were using both alcohol and alcohol-interacting medications on a regular basis [17]. […] Among participants in the Pennsylvania Assistance Contract for the Elderly program (aged 65–106 years) taking at least one prescription medication, 77% were taking an alcohol-interacting medication and 19% of the alcohol-interacting medication users reported concurrent use of alcohol [18]. […] Although these studies do not document adverse outcomes associated with alcohol–medication interactions, they do document that the potential exists for many older adults. […] High prevalence of concurrent use of alcohol and alcohol-interacting medications have also been reported in Australian men (43% of sedative or anxiolytic users were daily drinkers) [19], in older adults in Finland (42% of at-risk alcohol users were also taking alcohol-interacting medications) [20], and in older Irish adults (72% of participants were exposed to alcohol-interacting medications and 60% of these reported concurrent alcohol use) [21]. Drinking and medication use patterns in older adults may differ across countries, but alcohol–medication interactions appear to be a worldwide concern. […] Polypharmacy in general, and psychotropic burden specifically, has been associated with an increased risk of experiencing a geriatric syndrome such as falls or delirium, in older adults [26, 27]. Based on its pharmacology, alcohol can be considered as a psychotropic drug, and alcohol use should be assessed as part of the medication regimen evaluation to support efforts to prevent or manage geriatric syndromes. […] Combining alcohol and CNS active medications can be particularly problematic […] Older adults suffering from sleep problems or pain may be a particular risk for alcohol–medication interaction-related adverse events.”

In general, alcohol use in younger couples has been found to be highly concordant, that is, individuals in a relationship tend to engage in similar drinking behaviors [67,68]. Less is known, however, about alcohol use concordance between older couples. Graham and Braun [69] examined similarities in drinking behavior between spouses in a study of 826 community-dwelling older adults in Ontario, Canada. Results showed high concordance of drinking between spouses — whether they drank at all, how much they drank, and how frequently. […] Social learning theory suggests that alcohol use trajectories are strongly influenced by attitudes and behaviors of an individual’s social networks, particularly family and friends. When individuals engage in social activities with family and friends who approve of and engage in drinking, alcohol use, and misuse are reinforced [58, 59]. Evidence shows that among older adults, participation in social activities is correlated with higher levels of alcohol consumption [34, 60]. […] Brennan and Moos [29] […] found that older adults who reported less empathy and support from friends drank more alcohol, were more depressed, and were less self-confident. More stressors involving friends were associated with more drinking problems. Similar to the findings on marital conflict […], conflict in close friendships can prompt alcohol-use problems; conversely, these relationships can suffer as a result of alcohol-related problems. […] As opposed to social network theory […], social selection theory proposes that alcohol consumption changes an individual’s social context [33]. Studies among younger adults have shown that heavier drinkers chose partners and friends who approve of heavier drinking [70] and that excessive drinking can alienate social networks. The Moos study supports the idea that social selection also has a strong influence on drinking behavior among older adults.”

Traditionally, treatment studies in addiction have excluded patients over the age of 65. This bias has left a tremendous gap in knowledge regarding treatment outcomes and an understanding of the neurobiology of addiction in older adults.

Alcohol use causes well-established changes in sleep patterns, such as decreased sleep latency, decreased stage IV sleep, and precipitation or aggravation of sleep apnea [101]. There are also age-associated changes in sleep patterns including increased REM episodes, a decrease in REM length, a decrease in stage III and IV sleep, and increased awakenings. Age-associated changes in sleep can all be worsened by alcohol use and depression. Moeller and colleagues [102] demonstrated in younger subjects that alcohol and depression had additive effects upon sleep disturbances when they occurred together [102]. Wagman and colleagues [101] also have demonstrated that abstinent alcoholics did not sleep well because of insomnia, frequent awakenings, and REM fragmentation [101]; however, when these subjects ingested alcohol, sleep periodicity normalized and REM sleep was temporarily suppressed, suggesting that alcohol use could be used to self-medicate for sleep disturbances. A common anecdote from patients is that alcohol is used to help with sleep problems. […] The use of alcohol to self-medicate is considered maladaptive [34] and is associated with a host of negative outcomes. […] The use of alcohol to aid with sleep has been found to disrupt sleep architecture and cause sleep-related problems and daytime sleepiness [35, 36, 46]. Though alcohol is commonly used to aid with sleep initiation, it can worsen sleep-related breathing disorders and cause snoring and obstructive sleep apnea [36].”

Epidemiologic studies have clearly demonstrated that comorbidity between alcohol use and other psychiatric symptoms is common in younger age groups. Less is known about comorbidity between alcohol use and psychiatric illness in late life [88]. […] Blow et al. [90] reviewed the diagnosis of 3,986 VA patients between ages 60 and 69 presenting for alcohol treatment [90]. The most common comorbid psychiatric disorder was an affective disorder found in 21 % of the patients. […] Blazer et al. [91] studied 997 community dwelling elderly of whom only 4.5% had a history of alcohol use problems [91]; […] of these subjects, almost half had a comorbid diagnosis of depression or dysthymia. Comorbid depressive symptoms are not only common in late life but are also an important factor in the course and prognosis of psychiatric disorders. Depressed alcoholics have been shown to have a more complicated clinical course of depression with an increased risk of suicide and more social dysfunction than non-depressed alcoholics [9296]. […]  Alcohol use prior to late life has also been shown to influence treatment of late life depression. Cook and colleagues [94] found that a prior history of alcohol use problems predicted a more severe and chronic course for depression [94]. […] The effect of past heavy alcohol use is [also] highlighted in the findings from the Liverpool Longitudinal Study demonstrating a fivefold increase in psychiatric illness among elderly men who had a lifetime history of 5 or more years of heavy drinking [24]. The association between heavy alcohol consumption in earlier years and psychiatric morbidity in later life was not explained by current drinking habits. […] While Wernicke-Korsakoff’s syndrome is well described and often caused by alcohol use disorders, alcohol-related dementia may be difficult to differentiate from Alzheimer’s disease. Clinical diagnostic criteria for alcohol-related dementia (ARD) have been proposed and now validated in at least one trial, suggesting a method for distinguishing ARD, including Wernicke-Korsakoff’s syndrome, from other types of dementia [97, 98]. […] Finlayson et al. [100] found that 49 of 216 (23%) elderly patients presenting for alcohol treatment had dementia associated with alcohol use disorders [100].”

 

May 24, 2018 Posted by | Books, Demographics, Epidemiology, Medicine, Neurology, Pharmacology, Psychiatry, Statistics | Leave a comment

Trade-offs when doing medical testing

I was considering whether or not to blog the molecular biology text I recently read today, but I decided against it. However as I did feel like blogging today, I decided instead to add here a few comments I left on SCC. I rarely leave comments on other blogs, but it does happen, and the question I was ‘answering’ (partially – other guys had already added some pretty good comments by the time I joined the debate) is probably a question that I imagine a lot of e.g. undergrads are asking themselves, namely: “What’s the standard procedure, when designing a medical test, to determine the right tradeoff between sensitivity and specificity (where I’m picturing a tradeoff involved in choosing the threshold for a positive test or something similar)?

The ‘short version’, if you want an answer to this question, is probably to read Newman and Kohn’s wonderful book on these- and related- topics (which I blogged here), but that’s not actually a ‘short answer’ in terms of how people usually think about these things. I’ll just reproduce my own comment here, and mention that other guys had already covered some key topics by the time I joined ‘the fray’:

“Some good comments already. I don’t know to which extent the following points have been included in the links provided, but I decided to add them here anyway.

One point worth emphasizing is that you’ll always want a mixture of sensitivity and specificity (or, more broadly, test properties) that’ll mean that your test has clinical relevance. This relates both to the type of test you consider and when/whether to test at all (rather than treat/not treat without testing first). If you’re worried someone has disease X and there’s a high risk of said individual having disease X due to the clinical presentation, some tests will for example be inappropriate even if they are very good at making the distinction between individuals requiring treatment X and individuals not requiring treatment X, for example because they take time to perform that the patient might not have – not an uncommon situation in emergency medicine. If you’re so worried you’d treat him regardless of the test result, you shouldn’t test. And the same goes for e.g. low-sensitivity screens; if a positive test result of a screen does not imply that you’ll actually act on the result of the screen, you shouldn’t perform it (in screening contexts cost effectiveness is usually critically dependent on how you follow up on the test result, and in many contexts inadequate follow-up means that the value of the test goes down a lot […on a related note I have been thinking that I was perhaps not as kind as I could have been when I reviewed Juth & Munthe’s book and I have actually considered whether or not to change my rating of the book; it does give a decent introduction to some key trade-offs with which you’re confronted when you’re dealing with topics related to screening].

Cost effectiveness is another variable that would/should probably (in an ideal world?) enter the analysis when you’re judging what is or is not a good mixture of sensitivity and specificity – you should be willing to pay more for more precise tests, but only to the extent that those more precise tests lead to better outcomes (you’re usually optimizing over patient outcomes, not test accuracy).

Skef also mentions this, but the relative values of specificity and sensitivity may well vary during the diagnostic process; i.e. the (ideal) trade-off will depend on what you plan to use the test for. Is the idea behind testing this guy to make (reasonably?) sure he doesn’t have colon cancer, or to figure out if he needs a more accurate, but also more expensive, test? Screening setups will usually involve a multi-level testing structure, and tests at different levels will not treat these trade-offs the same way, nor should they. This also means that the properties of individual tests can not really be viewed in isolation, which makes the problem of finding ‘the ideal mix’ of test properties (whatever these might be) even harder; if you have three potential tests for example, it’s not enough to compare the tests individually against each other, you’d ideally also want to implicitly take into account that different combinations of tests have different properties, and that the timing of the test may also be an important parameter in the decision problem.”

On a related note I think that in general the idea of looking for some kind of ‘approved method’ that you can use to save yourself from thinking is a very dangerous approach when you’re doing applied statistics. If you’re not thinking about relevant trade-offs and how to deal with them, odds are you’re missing a big part of the picture. If somebody claims to have somehow discovered some simple approach to dealing with all of the relevant trade-offs, well, you should be very skeptical. Statistics usually don’t work like that.

May 4, 2018 Posted by | Medicine, Statistics | Leave a comment

Medical Statistics (III)

In this post I’ll include some links and quotes related to topics covered in chapters 4, 6, and 7 of the book. Before diving in, I’ll however draw attention to some of Gerd Gigerenzer’s work as it is quite relevant to in particular the coverage included in chapter 4 (‘Presenting research findings’), even if the authors seem unaware of this. One of Gigerenzer’s key insights, which I consider important and which I have thus tried to keep in mind, unfortunately goes unmentioned in the book; namely the idea that how you communicate risk might be very important in terms of whether or not people actually understand what you are trying to tell them. A related observation is that people have studied these things and they’ve figured out that some types of risk communication are demonstrably better than others at enabling people to understand the issues at hand and the trade-offs involved in a given situation. I covered some of these ideas in a comment on SCC some time ago; if those comments spark your interest you should definitely go read the book).

IMRAD format.
CONSORT Statement (randomized trials).
Equator Network.

“Abstracts may appear easy to write since they are very short […] and often required to be written in a structured format. It is therefore perhaps surprising that they are sometimes poorly written, too bland, contain inaccuracies, and/or are simply misleading.1  The reason for poor quality abstracts are complex; abstracts are often written at the end of a long process of data collection, analysis, and writing up, when time is short and researchers are weary. […] statistical issues […] can lead to an abstract that is not a fair representation of the research conducted. […] it is important that the abstract is consistent with the body of text and that it gives a balanced summary of the work. […] To maximize its usefulness, a summary or abstract should include estimates and confidence intervals for the main findings and not simply present P values.”

“The methods section should describe how the study was conducted. […] it is important to include the following: *The setting or area […] The date(s) […] subjects included […] study design […] measurements used […] source of any non-original data […] sample size, including a justification […] statistical methods, including any computer software used […] The discussion section is where the findings of the study are discussed and interpreted […] this section tends to include less statistics than the results section […] Some medical journals have a specific structure for the discussion for researchers to follow, and so it is important to check the journal’s guidelines before submitting. […] [When] reporting statistical analyses from statistical programs: *Don’t put unedited computer output into a research document. *Extract the relevant data only and reformat as needed […] Beware of presenting percentages for very small samples as they may be misleading. Simply give the numbers alone. […] In general the following is recommended for P values: *Give the actual P value whenever possible. *Rounding: Two significant figures are usually enough […] [Confidence intervals] should be given whenever possible to indicate the precision of estimates. […] Avoid graphs with missing zeros or stretched scales […] a table or graph should stand alone so that a reader does not need to read the […] article to be able to understand it.”

Statistical data type.
Level of measurement.
Descriptive statistics.
Summary statistics.
Geometric mean.
Harmonic mean.
Mode.
Interquartile range.
Histogram.
Stem and leaf plot.
Box and whisker plot.
Dot plot.

“Quantitative data are data that can be measured numerically and may be continuous or discrete. *Continuous data lie on a continuum and so can take any value between two limits. […] *Discrete data do not lie on a continuum and can only take certain values, usually counts (integers) […] On an interval scale, differences between values at different points of the scale have the same meaning […] Data can be regarded as on a ratio scale if the ratio of the two measurements has a meaning. For example we can say that twice as many people in one group had a particular characteristic compared with another group and this has a sensible meaning. […] Quantitative data are always ordinal – the data values can be arranged in a numerical order from the smallest to the largest. […] *Interval scale data are always ordinal. Ratio scale data are always interval scale data and therefore must also be ordinal. *In practice, continuous data may look discrete because of the way they are measured and/or reported. […] All continuous measurements are limited by the accuracy of the instrument used to measure them, and many quantities such as age and height are reported in whole numbers for convenience”.

“Categorical data are data where individuals fall into a number of separate categories or classes. […] Different categories of categorical data may be assigned a number for coding purposes […] and if there are several categories, there may be an implied ordering, such as with stage of cancer where stage I is the least advanced and stage IV is the most advanced. This means that such data are ordinal but not interval because the ‘distance’ between adjacent categories has no real measurement attached to it. The ‘gap’ between stages I and II disease is not necessarily the same as the ‘gap’ between stages III and IV. […] Where categorical data are coded with numerical codes, it might appear that there is an ordering but this may not necessarily be so. It is important to distinguish between ordered and non-ordered data because it affects the analysis.”

“It is usually useful to present more than one summary measure for a set of data […] If the data are going to be analyzed using methods based on means then it makes sense to present means rather than medians. If the data are skewed they may need to be transformed before analysis and so it is best to present summaries based on the transformed data, such as geometric means. […] For very skewed data rather than reporting the median, it may be helpful to present a different percentile (i.e. not the 50th), which better reflects the shape of the distribution. […] Some researchers are reluctant to present the standard deviation when the data are skewed and so present the median and range and/or quartiles. If analyses are planned which are based on means then it makes sense to be consistent and give standard deviations. Further, the useful relationship that approximately 95% of the data lie between mean +/- 2 standard deviations, holds even for skewed data […] If data are transformed, the standard deviation cannot be back-transformed correctly and so for transformed data a standard deviation cannot be given. In this case the untransformed standard deviation can be given or another measure of spread. […] For discrete data with a narrow range, such as stage of cancer, it may be better to present the actual frequency distribution to give a fair summary of the data, rather than calculate a mean or dichotomize it. […] It is often useful to tabulate one categorical variable against another to show the proportions or percentages of the categories of one variable by the other”.

Random variable.
Independence (probability theory).
Probability.
Probability distribution.
Binomial distribution.
Poisson distribution.
Continuous probability distribution.
Normal distribution.
Uniform distribution.

“The central limit theorem is a very important mathematical theorem that links the Normal distribution with other distributions in a unique and surprising way and is therefore very useful in statistics. *The sum of a large number of independent random variables will follow an approximately Normal distribution irrespective of their underlying distributions. *This means that any random variable which can be regarded as a the sum of a large number of small, independent contributions is likely to follow the Normal distribution. [I didn’t really like this description as it’s insufficiently detailed for my taste (and this was pretty much all they wrote about the CLT in that chapter); and one problem with the CLT is that people often think it applies when it might not actually do so, because the data restrictions implied by the theorem(s) are not really fully appreciated. On a related note people often seem to misunderstand what these theorems actually say and where they apply – see e.g. paragraph 10 in this post. See also the wiki link above for a more comprehensive treatment of these topicsUS] *The Normal distribution can be used as an approximation to the Binomial distribution when n is large […] The Normal distribution can be used as an approximation to the Poisson distribution as the mean of the Poisson distribution increases […] The main advantage in using the Normal rather than the Binomial or the Poisson distribution is that it makes it easier to calculate probabilities and confidence intervals”

“The t distribution plays an important role in statistics as the sampling distribution of the sample mean divided by its standard error and is used in significance testing […] The shape is symmetrical about the mean value, and is similar to the Normal distribution but with a higher peak and longer tails to take account of the reduced precision in smaller samples. The exact shape is determined by the mean and variance plus the degrees of freedom. As the degrees of freedom increase, the shape comes closer to the Normal distribution […] The chi-squared distribution also plays an important role in statistics. If we take several variables, say n, which each follow a standard Normal distribution, and square each and add them, the sum of these will follow a chi-squared distribution with n degrees of freedom. This theoretical result is very useful and widely used in statistical testing […] The chi-squared distribution is always positive and its shape is uniquely determined by the degrees of freedom. The distribution becomes more symmetrical as the degrees of freedom increases. […] [The (noncentral) F distribution] is the distribution of the ratio of two chi-squared distributions and is used in hypothesis testing when we want to compare variances, such as in doing analysis of variance […] Sometimes data may follow a positively skewed distribution which becomes a Normal distribution when each data point is log-transformed [..] In this case the original data can be said to follow a lognormal distribution. The transformation of such data from log-normal to Normal is very useful in allowing skewed data to be analysed using methods based on the Normal distribution since these are usually more powerful than alternative methods”.

Half-Normal distribution.
Bivariate Normal distribution.
Negative binomial distribution.
Beta distribution.
Gamma distribution.
Conditional probability.
Bayes theorem.

April 26, 2018 Posted by | Books, Data, Mathematics, Medicine, Statistics | Leave a comment

Medical Statistics (II)

In this post I’ll include some links and quotes related to topics covered in chapters 2 and 3 of the book. Chapter 2 is about ‘Collecting data’ and chapter 3 is about ‘Handling data: what steps are important?’

“Data collection is a key part of the research process, and the collection method will impact on later statistical analysis of the data. […] Think about the anticipated data analysis [in advance] so that data are collected in the appropriate format, e.g. if a mean will be needed for the analysis, then don’t record the data in categories, record the actual value. […] *It is useful to pilot the data collection process in a range of circumstances to make sure it will work in practice. *This usually involves trialling the data collection form on a smaller sample than intended for the study and enables problems with the data collection form to be identified and resolved prior to main data collection […] In general don’t expect the person filling out the form to do calculations as this may lead to errors, e.g. calculating a length of time between two dates. Instead, record each piece of information to allow computation of the particular value later […] The coding scheme should be designed at the same time as the form so that it can be built into the form. […] It may be important to distinguish between data that are simply missing from the original source and data that the data extractor failed to record. This can be achieved using different codes […] The use of numerical codes for non-numerical data may give the false impression that these data can be treated as if they were numerical data in the statistical analysis. This is not so.”

“It is critical that data quality is monitored and that this happens as the study progresses. It may be too late if problems are only discovered at the analysis stage. If checks are made during the data collection then problems can be corrected. More frequent checks may be worthwhile at the beginning of data collection when processes may be new and staff may be less experienced. […] The layout […] affects questionnaire completion rates and therefore impacts on the overall quality of the data collected.”

“Sometimes researchers need to develop a new measurement or questionnaire scale […] To do this rigorously requires a thorough process. We will outline the main steps here and note the most common statistical measures used in the process. […] Face validity *Is the scale measuring what it sets out to measure? […] Content validity *Does the scale cover all the relevant areas? […] *Between-observers consistency: is there agreement between different observers assessing the same individuals? *Within-observers consistency: is there agreement between assessments on the same individuals by the same observer on two different occasions? *Test-retest consistency: are assessments made on two separate occasions on the same individual similar? […] If a scale has several questions or items which all address the same issue then we usually expect each individual to get similar scores for those questions, i.e. we expect their responses to be internally consistent. […] Cronbach’s alpha […] is often used to assess the degree of internal consistency. [It] is calculated as an average of all correlations among the different questions on the scale. […] *Values are usually expected to be above 0.7 and below 0.9 *Alpha below 0.7 broadly indicates poor internal consistency *Alpha above 0.9 suggests that the items are very similar and perhaps fewer items could be used to obtain the same overall information”.

Bland–Altman plot.
Coefficient of variation.
Intraclass correlation.
Cohen’s kappa.
Likert scale. (“The key characteristic of Likert scales is that the scale is symmetrical. […] Care is needed when analyzing Likert scale data even though a numerical code is assigned to the responses, since the data are ordinal and discrete. Hence an average may be misleading […] It is quite common to collapse Likert scales into two or three categories such as agree versus disagree, but this has the disadvantage that data are discarded.”)
Visual analogue scale. (“VAS scores can be treated like continuous data […] Where it is feasible to use a VAS, it is preferable as it provides greater statistical power than a categorical scale”)

“Correct handling of data is essential to produce valid and reliable statistics. […] Data from research studies need to be coded […] It is important to document the coding scheme for categorical variables such as sex where it will not be obviously [sic, US] what the values mean […] It is strongly recommended that a unique numerical identifier is given to each subject, even if the research is conducted anonymously. […] Computerized datasets are often stored in a spreadsheet format with rows and columns of data. For most statistical analyses it is best to enter the data so that each row represents a different subject and each column a different variable. […] Prefixes or suffixes can be used to denote […] repeated measurements. If there are several repeated variables, use the same ‘scheme’ for all to avoid confusion. […] Try to avoid mixing suffixes and prefixes as it can cause confusion.”

“When data are entered onto a computer at different times it may be necessary to join datasets together. […] It is important to avoid over-writing a current dataset with a new updated version without keeping the old version as a separate file […] the two datasets must use exactly the same variable names for the same variables and the same coding. Any spelling mistakes will prevent a successful joining. […] It is worth checking that the joining has worked as expected by checking that the total number of observations in the updated file is the sum of the two previous files, and that the total number of variables is unchanged. […] When new data are collected on the same individuals at a later stage […], it may [again] be necessary to merge datasets. In order to do this the unique subject identifier must be used to identify the records that must be matched. For the merge to work, all variable names in the two datasets must be different except for the unique identifier. […] Spreadsheets are useful for entering and storing data. However, care should be taken when cutting and pasting different datasets to avoid misalignment of data. […] it is best not to join or sort datasets using a spreadsheet […in some research contexts, I’d add, this is also just plain impossible to even try, due to the amount of data involved – US…] […] It is important to ensure that a unique copy of the current file, the ‘master copy’, is stored at all times. Where the study involves more than one investigator, everyone needs to know who has responsibility for this. It is also important to avoid having two people revising the same file at the same time. […] It is important to keep a record of any changes that are made to the dataset and keep dated copies of datasets as changes are made […] Don’t overwrite datasets with edited versions as older versions may be needed later on.”

“Where possible, it is important to do some [data entry] checks early on to leave time for addressing problems while the study is in progress. […] *Check a random sample of forms for data entry accuracy. If this reveals problems then further checking may be needed. […] If feasible, consider checking data entry forms for key variables, e.g. the primary outcome. […] Range checks: […] tabulate all data to ensure there are no invalid values […] make sure responses are consistent with each other within subjects, e.g. check for any impossible or unlikely combination of responses such as a male with a pregnancy […] Check where feasible that any gaps are true gaps and not missed data entry […] Sometimes finding one error may lead to others being uncovered. For example, if a spreadsheet was used for data entry and one entry was missed, all following entries may be in the wrong columns. Hence, always consider if the discovery of one error may imply that there are others. […] Plots can be useful for checking larger datasets.”

Data monitoring committee.
Damocles guidelines.
Overview of stopping rules for clinical trials.
Pocock boundary.
Haybittle–Peto boundary.

“Trials are only stopped early when it is considered that the evidence for either benefit or harm is overwhelmingly strong. In such cases, the effect size will inevitably be larger than anticipated at the outset of the trial in order to trigger the early stop. Hence effect estimates from trials stopped early tend to be more extreme than would be the case if these trials had continued to the end, and so estimates of the efficacy or harm of a particular treatment may be exaggerated. This phenomenon has been demonstrated in recent reviews.1,2 […] Sometimes it becomes apparent part way through a trial that the assumptions made in the original sample size calculations are not correct. For example, where the primary outcome is a continuous variable, an estimate of the standard deviation (SD) is needed to calculate the required sample size. When the data are summarized during the trial, it may become apparent that the observed SD is different from that expected. This has implications for the statistical power. If the observed SD is smaller than expected then it may be reasonable to reduce the sample size but if it is bigger then it may be necessary to increase it.”

April 16, 2018 Posted by | Books, Medicine, Statistics | Leave a comment

Medical Statistics (I)

I was more than a little critical of the book in my review on goodreads, and the review is sufficiently detailed that I thought it would be worth including it in this post. Here’s what I wrote on goodreads (slightly edited to take full advantage of the better editing options on wordpress):

“The coverage is excessively focused on significance testing. The book also provides very poor coverage of model selection topics, where the authors not once but repeatedly recommend employing statistically invalid approaches to model selection (the authors recommend using hypothesis testing mechanisms to guide model selection, as well as using adjusted R-squared for model selection decisions – both of which are frankly awful ideas, for reasons which are obvious to people familiar with the field of model selection. “Generally, hypothesis testing is a very poor basis for model selection […] There is no statistical theory that supports the notion that hypothesis testing with a fixed α level is a basis for model selection.” “While adjusted R2 is useful as a descriptive statistic, it is not useful in model selection” – quotes taken directly from Burnham & Anderson’s book Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach).

The authors do not at any point in the coverage even mention the option of using statistical information criteria to guide model selection decisions, and frankly repeatedly recommend doing things which are known to be deeply problematic. The authors also cover material from Borenstein and Hedges’ meta-analysis text in the book, yet still somehow manage to give poor advice in the context of meta-analysis along similar lines (implicitly advising people to base model decisions within the context of whether to use fixed effects or random effects on the results of heterogeneity tests, despite this approach being criticized as problematic in the formerly mentioned text).

Basic and not terrible, but there are quite a few problems with this text.”

I’ll add a few more details about the above-mentioned problems before moving on to the main coverage. As for the model selection topic I refer specifically to my coverage of Burnham and Anderson’s book here and here – these guys spent a lot of pages talking about why you shouldn’t do what the authors of this book recommend, and I’m sort of flabbergasted medical statisticians don’t know this kind of stuff by now. To people who’ve read both these books, it’s not really in question who’s in the right here.

I believe part of the reason why I was very annoyed at the authors at times was that they seem to promote exactly a sort of blind unthinking hypothesis-testing approach to things that is unfortunately very common – the entire book is saturated with hypothesis testing stuff, which means that many other topics are woefully insufficiently covered. The meta-analysis example is probably quite illustrative; the authors spend multiple pages on study heterogeneity and how to deal with it, but the entire coverage there is centered around the discussion of a most-likely underpowered test, the result of which should perhaps in the best case scenario direct the researcher’s attention to topics he should be have been thinking carefully about from the very start of his data analysis. You don’t need to quote many words from Borenstein and Hedges (here’s a relevant link) to get to the heart of the matter here:

“It makes sense to use the fixed-effect model if two conditions are met. First, we believe that all the studies included in the analysis are functionally identical. Second, our goal is to compute the common effect size for the identified population, and not to generalize to other populations. […] this situation is relatively rare. […] By contrast, when the researcher is accumulating data from a series of studies that had been performed by researchers operating independently, it would be unlikely that all the studies were functionally equivalent. Typically, the subjects or interventions in these studies would have differed in ways that would have impacted on the results, and therefore we should not assume a common effect size. Therefore, in these cases the random-effects model is more easily justified than the fixed-effect model.

A report should state the computational model used in the analysis and explain why this model was selected. A common mistake is to use the fixed-effect model on the basis that there is no evidence of heterogeneity. As [already] explained […], the decision to use one model or the other should depend on the nature of the studies, and not on the significance of this test [because the test will often have low power anyway].”

Yet these guys spend their efforts here talking about a test that is unlikely to yield useful information and which if anything probably distracts the reader from the main issues at hand; are the studies functionally equivalent? Do we assume there’s one (‘true’) effect size, or many? What do those coefficients we’re calculating actually mean? The authors do in fact include a lot of cautionary notes about how to interpret the test, but in my view all this means is that they’re devoting critical pages to peripheral issues – and perhaps even reinforcing the view that the test is important, or why else would they spend so much effort on it? – rather than promote good thinking about the key topics at hand.

Anyway, enough of the critical comments. Below a few links related to the first chapter of the book, as well as some quotes.

Declaration of Helsinki.
Randomized controlled trial.
Minimization (clinical trials).
Blocking (statistics).
Informed consent.
Blinding (RCTs). (…related xkcd link).
Parallel study. Crossover trial.
Zelen’s design.
Superiority, equivalence, and non-inferiority trials.
Intention-to-treat concept: A review.
Case-control study. Cohort study. Nested case-control study. Cross-sectional study.
Bradford Hill criteria.
Research protocol.
Sampling.
Type 1 and type 2 errors.
Clinical audit. A few quotes on this topic:

“‘Clinical audit’ is a quality improvement process that seeks to improve the patient care and outcomes through systematic review of care against explicit criteria and the implementation of change. Aspects of the structures, processes and outcomes of care are selected and systematically evaluated against explicit criteria. […] The aim of audit is to monitor clinical practice against agreed best practice standards and to remedy problems. […] the choice of topic is guided by indications of areas where improvement is needed […] Possible topics [include] *Areas where a problem has been identified […] *High volume practice […] *High risk practice […] *High cost […] *Areas of clinical practice where guidelines or firm evidence exists […] The organization carrying out the audit should have the ability to make changes based on their findings. […] In general, the same methods of statistical analysis are used for audit as for research […] The main difference between audit and research is in the aim of the study. A clinical research study aims to determine what practice is best, whereas an audit checks to see that best practice is being followed.”

A few more quotes from the end of the chapter:

“In clinical medicine and in medical research it is fairly common to categorize a biological measure into two groups, either to aid diagnosis or to classify an outcome. […] It is often useful to categorize a measurement in this way to guide decision-making, and/or to summarize the data but doing this leads to a loss of information which in turn has statistical consequences. […] If a continuous variable is used for analysis in a research study, a substantially smaller sample size will be needed than if the same variable is categorized into two groups […] *Categorization of a continuous variable into two groups loses much data and should be avoided whenever possible *Categorization of a continuous variable into several groups is less problematic”

“Research studies require certain specific data which must be collected to fulfil the aims of the study, such as the primary and secondary outcomes and main factors related to them. Beyond these data there are often other data that could be collected and it is important to weigh the costs and consequences of not collecting data that will be needed later against the disadvantages of collecting too much data. […] collecting too much data is likely to add to the time and cost to data collection and processing, and may threaten the completeness and/or quality of all of the data so that key data items are threatened. For example if a questionnaire is overly long, respondents may leave some questions out or may refuse to fill it out at all.”

Stratified samples are used when fixed numbers are needed from particular sections or strata of the population in order to achieve balance across certain important factors. For example a study designed to estimate the prevalence of diabetes in different ethnic groups may choose a random sample with equal numbers of subjects in each ethnic group to provide a set of estimates with equal precision for each group. If a simple random sample is used rather than a stratified sample, then estimates for minority ethnic groups may be based on small numbers and have poor precision. […] Cluster samples may be chosen where individuals fall naturally into groups or clusters. For example, patients on a hospital wards or patients in a GP practice. If a sample is needed of these patients, it may be easier to list the clusters and then to choose a random sample of clusters, rather than to choose a random sample of the whole population. […] Cluster sampling is less efficient statistically than simple random sampling […] the ICC summarizes the extent of the ‘clustering effect’. When individuals in the same cluster are much more alike than individuals in different clusters with respect to an outcome, then the clustering effect is greater and the impact on the required sample size is correspondingly greater. In practice there can be a substantial effect on the sample size even when the ICC is quite small. […] As well as considering how representative a sample is, it is important […] to consider the size of the sample. A sample may be unbiased and therefore representative, but too small to give reliable estimates. […] Prevalence estimates from small samples will be imprecise and therefore may be misleading. […] The greater the variability of a measure, the greater the number of subjects needed in the sample to estimate it precisely. […] the power of a study is the ability of the study to detect a difference if one exists.”

April 9, 2018 Posted by | Books, Epidemiology, Medicine, Statistics | Leave a comment

Networks

I actually think this was a really nice book, considering the format – I gave it four stars on goodreads. One of the things I noticed people didn’t like about it in the reviews is that it ‘jumps’ a bit in terms of topic coverage; it covers a wide variety of applications and analytical settings. I mostly don’t consider this a weakness of the book – even if occasionally it does get a bit excessive – and I can definitely understand the authors’ choice of approach; it’s sort of hard to illustrate the potential the analytical techniques described within this book have if you’re not allowed to talk about all the areas in which they have been – or could be gainfully – applied. A related point is that many people who read the book might be familiar with the application of these tools in specific contexts but have perhaps not thought about the fact that similar methods are applied in many other areas (and they might all of them be a bit annoyed the authors don’t talk more about computer science applications, or foodweb analyses, or infectious disease applications, or perhaps sociometry…). Most of the book is about graph-theory-related stuff, but a very decent amount of the coverage deals with applications, in a broad sense of the word at least, not theory. The discussion of theoretical constructs in the book always felt to me driven to a large degree by their usefulness in specific contexts.

I have covered related topics before here on the blog, also quite recently – e.g. there’s at least some overlap between this book and Holland’s book about complexity theory in the same series (I incidentally think these books probably go well together) – and as I found the book slightly difficult to blog as it was I decided against covering it in as much detail as I sometimes do when covering these texts – this means that I decided to leave out the links I usually include in posts like these.

Below some quotes from the book.

“The network approach focuses all the attention on the global structure of the interactions within a system. The detailed properties of each element on its own are simply ignored. Consequently, systems as different as a computer network, an ecosystem, or a social group are all described by the same tool: a graph, that is, a bare architecture of nodes bounded by connections. […] Representing widely different systems with the same tool can only be done by a high level of abstraction. What is lost in the specific description of the details is gained in the form of universality – that is, thinking about very different systems as if they were different realizations of the same theoretical structure. […] This line of reasoning provides many insights. […] The network approach also sheds light on another important feature: the fact that certain systems that grow without external control are still capable of spontaneously developing an internal order. […] Network models are able to describe in a clear and natural way how self-organization arises in many systems. […] In the study of complex, emergent, and self-organized systems (the modern science of complexity), networks are becoming increasingly important as a universal mathematical framework, especially when massive amounts of data are involved. […] networks are crucial instruments to sort out and organize these data, connecting individuals, products, news, etc. to each other. […] While the network approach eliminates many of the individual features of the phenomenon considered, it still maintains some of its specific features. Namely, it does not alter the size of the system — i.e. the number of its elements — or the pattern of interaction — i.e. the specific set of connections between elements. Such a simplified model is nevertheless enough to capture the properties of the system. […] The network approach [lies] somewhere between the description by individual elements and the description by big groups, bridging the two of them. In a certain sense, networks try to explain how a set of isolated elements are transformed, through a pattern of interactions, into groups and communities.”

“[T]he random graph model is very important because it quantifies the properties of a totally random network. Random graphs can be used as a benchmark, or null case, for any real network. This means that a random graph can be used in comparison to a real-world network, to understand how much chance has shaped the latter, and to what extent other criteria have played a role. The simplest recipe for building a random graph is the following. We take all the possible pair of vertices. For each pair, we toss a coin: if the result is heads, we draw a link; otherwise we pass to the next pair, until all the pairs are finished (this means drawing the link with a probability p = ½, but we may use whatever value of p). […] Nowadays [the random graph model] is a benchmark of comparison for all networks, since any deviations from this model suggests the presence of some kind of structure, order, regularity, and non-randomness in many real-world networks.”

“…in networks, topology is more important than metrics. […] In the network representation, the connections between the elements of a system are much more important than their specific positions in space and their relative distances. The focus on topology is one of its biggest strengths of the network approach, useful whenever topology is more relevant than metrics. […] In social networks, the relevance of topology means that social structure matters. […] Sociology has classified a broad range of possible links between individuals […]. The tendency to have several kinds of relationships in social networks is called multiplexity. But this phenomenon appears in many other networks: for example, two species can be connected by different strategies of predation, two computers by different cables or wireless connections, etc. We can modify a basic graph to take into account this multiplexity, e.g. by attaching specific tags to edges. […] Graph theory [also] allows us to encode in edges more complicated relationships, as when connections are not reciprocal. […] If a direction is attached to the edges, the resulting structure is a directed graph […] In these networks we have both in-degree and out-degree, measuring the number of inbound and outbound links of a node, respectively. […] in most cases, relations display a broad variation or intensity [i.e. they are not binary/dichotomous]. […] Weighted networks may arise, for example, as a result of different frequencies of interactions between individuals or entities.”

“An organism is […] the outcome of several layered networks and not only the deterministic result of the simple sequence of genes. Genomics has been joined by epigenomics, transcriptomics, proteomics, metabolomics, etc., the disciplines that study these layers, in what is commonly called the omics revolution. Networks are at the heart of this revolution. […] The brain is full of networks where various web-like structures provide the integration between specialized areas. In the cerebellum, neurons form modules that are repeated again and again: the interaction between modules is restricted to neighbours, similarly to what happens in a lattice. In other areas of the brain, we find random connections, with a more or less equal probability of connecting local, intermediate, or distant neurons. Finally, the neocortex — the region involved in many of the higher functions of mammals — combines local structures with more random, long-range connections. […] typically, food chains are not isolated, but interwoven in intricate patterns, where a species belongs to several chains at the same time. For example, a specialized species may predate on only one prey […]. If the prey becomes extinct, the population of the specialized species collapses, giving rise to a set of co-extinctions. An even more complicated case is where an omnivore species predates a certain herbivore, and both eat a certain plant. A decrease in the omnivore’s population does not imply that the plant thrives, because the herbivore would benefit from the decrease and consume even more plants. As more species are taken into account, the population dynamics can become more and more complicated. This is why a more appropriate description than ‘foodchains’ for ecosystems is the term foodwebs […]. These are networks in which nodes are species and links represent relations of predation. Links are usually directed (big fishes eat smaller ones, not the other way round). These networks provide the interchange of food, energy, and matter between species, and thus constitute the circulatory system of the biosphere.”

“In the cell, some groups of chemicals interact only with each other and with nothing else. In ecosystems, certain groups of species establish small foodwebs, without any connection to external species. In social systems, certain human groups may be totally separated from others. However, such disconnected groups, or components, are a strikingly small minority. In all networks, almost all the elements of the systems take part in one large connected structure, called a giant connected component. […] In general, the giant connected component includes not less than 90 to 95 per cent of the system in almost all networks. […] In a directed network, the existence of a path from one node to another does not guarantee that the journey can be made in the opposite direction. Wolves eat sheep, and sheep eat grass, but grass does not eat sheep, nor do sheep eat wolves. This restriction creates a complicated architecture within the giant connected component […] according to an estimate made in 1999, more than 90 per cent of the WWW is composed of pages connected to each other, if the direction of edges is ignored. However, if we take direction into account, the proportion of nodes mutually reachable is only 24 per cent, the giant strongly connected component. […] most networks are sparse, i.e. they tend to be quite frugal in connections. Take, for example, the airport network: the personal experience of every frequent traveller shows that direct flights are not that common, and intermediate stops are necessary to reach several destinations; thousands of airports are active, but each city is connected to less than 20 other cities, on average. The same happens in most networks. A measure of this is given by the mean number of connection of their nodes, that is, their average degree.”

“[A] puzzling contradiction — a sparse network can still be very well connected — […] attracted the attention of the Hungarian mathematicians […] Paul Erdős and Alfréd Rényi. They tackled it by producing different realizations of their random graph. In each of them, they changed the density of edges. They started with a very low density: less than one edge per node. It is natural to expect that, as the density increases, more and more nodes will be connected to each other. But what Erdős and Rényi found instead was a quite abrupt transition: several disconnected components coalesced suddenly into a large one, encompassing almost all the nodes. The sudden change happened at one specific critical density: when the average number of links per node (i.e. the average degree) was greater than one, then the giant connected component suddenly appeared. This result implies that networks display a very special kind of economy, intrinsic to their disordered structure: a small number of edges, even randomly distributed between nodes, is enough to generate a large structure that absorbs almost all the elements. […] Social systems seem to be very tightly connected: in a large enough group of strangers, it is not unlikely to find pairs of people with quite short chains of relations connecting them. […] The small-world property consists of the fact that the average distance between any two nodes (measured as the shortest path that connects them) is very small. Given a node in a network […], few nodes are very close to it […] and few are far from it […]: the majority are at the average — and very short — distance. This holds for all networks: starting from one specific node, almost all the nodes are at very few steps from it; the number of nodes within a certain distance increases exponentially fast with the distance. Another way of explaining the same phenomenon […] is the following: even if we add many nodes to a network, the average distance will not increase much; one has to increase the size of a network by several orders of magnitude to notice that the paths to new nodes are (just a little) longer. The small-world property is crucial to many network phenomena. […] The small-world property is something intrinsic to networks. Even the completely random Erdős-Renyi graphs show this feature. By contrast, regular grids do not display it. If the Internet was a chessboard-like lattice, the average distance between two routers would be of the order of 1,000 jumps, and the Net would be much slower [the authors note elsewhere that “The Internet is composed of hundreds of thousands of routers, but just about ten ‘jumps’ are enough to bring an information packet from one of them to any other.”] […] The key ingredient that transforms a structure of connections into a small world is the presence of a little disorder. No real network is an ordered array of elements. On the contrary, there are always connections ‘out of place’. It is precisely thanks to these connections that networks are small worlds. […] Shortcuts are responsible for the small-world property in many […] situations.”

“Body size, IQ, road speed, and other magnitudes have a characteristic scale: that is, an average value that in the large majority of cases is a rough predictor of the actual value that one will find. […] While height is a homogeneous magnitude, the number of social connection[s] is a heterogeneous one. […] A system with this feature is said to be scale-free or scale-invariant, in the sense that it does not have a characteristic scale. This can be rephrased by saying that the individual fluctuations with respect to the average are too large for us to make a correct prediction. […] In general, a network with heterogeneous connectivity has a set of clear hubs. When a graph is small, it is easy to find whether its connectivity is homogeneous or heterogeneous […]. In the first case, all the nodes have more or less the same connectivity, while in the latter it is easy to spot a few hubs. But when the network to be studied is very big […] things are not so easy. […] the distribution of the connectivity of the nodes of the […] network […] is the degree distribution of the graph. […] In homogeneous networks, the degree distribution is a bell curve […] while in heterogeneous networks, it is a power law […]. The power law implies that there are many more hubs (and much more connected) in heterogeneous networks than in homogeneous ones. Moreover, hubs are not isolated exceptions: there is a full hierarchy of nodes, each of them being a hub compared with the less connected ones.”

“Looking at the degree distribution is the best way to check if a network is heterogeneous or not: if the distribution is fat tailed, then the network will have hubs and heterogeneity. A mathematically perfect power law is never found, because this would imply the existence of hubs with an infinite number of connections. […] Nonetheless, a strongly skewed, fat-tailed distribution is a clear signal of heterogeneity, even if it is never a perfect power law. […] While the small-world property is something intrinsic to networked structures, hubs are not present in all kind of networks. For example, power grids usually have very few of them. […] hubs are not present in random networks. A consequence of this is that, while random networks are small worlds, heterogeneous ones are ultra-small worlds. That is, the distance between their vertices is relatively smaller than in their random counterparts. […] Heterogeneity is not equivalent to randomness. On the contrary, it can be the signature of a hidden order, not imposed by a top-down project, but generated by the elements of the system. The presence of this feature in widely different networks suggests that some common underlying mechanism may be at work in many of them. […] the Barabási–Albert model gives an important take-home message. A simple, local behaviour, iterated through many interactions, can give rise to complex structures. This arises without any overall blueprint”.

Homogamy, the tendency of like to marry like, is very strong […] Homogamy is a specific instance of homophily: this consists of a general trend of like to link to like, and is a powerful force in shaping social networks […] assortative mixing [is] a special form of homophily, in which nodes tend to connect with others that are similar to them in the number of connections. By contrast [when] high- and low-degree nodes are more connected to each other [it] is called disassortative mixing. Both cases display a form of correlation in the degrees of neighbouring nodes. When the degrees of neighbours are positively correlated, then the mixing is assortative; when negatively, it is disassortative. […] In random graphs, the neighbours of a given node are chosen completely at random: as a result, there is no clear correlation between the degrees of neighbouring nodes […]. On the contrary, correlations are present in most real-world networks. Although there is no general rule, most natural and technological networks tend to be disassortative, while social networks tend to be assortative. […] Degree assortativity and disassortativity are just an example of the broad range of possible correlations that bias how nodes tie to each other.”

“[N]etworks (neither ordered lattices nor random graphs), can have both large clustering and small average distance at the same time. […] in almost all networks, the clustering of a node depends on the degree of that node. Often, the larger the degree, the smaller the clustering coefficient. Small-degree nodes tend to belong to well-interconnected local communities. Similarly, hubs connect with many nodes that are not directly interconnected. […] Central nodes usually act as bridges or bottlenecks […]. For this reason, centrality is an estimate of the load handled by a node of a network, assuming that most of the traffic passes through the shortest paths (this is not always the case, but it is a good approximation). For the same reason, damaging central nodes […] can impair radically the flow of a network. Depending on the process one wants to study, other definitions of centrality can be introduced. For example, closeness centrality computes the distance of a node to all others, and reach centrality factors in the portion of all nodes that can be reached in one step, two steps, three steps, and so on.”

“Domino effects are not uncommon in foodwebs. Networks in general provide the backdrop for large-scale, sudden, and surprising dynamics. […] most of the real-world networks show a doubled-edged kind of robustness. They are able to function normally even when a large fraction of the network is damaged, but suddenly certain small failures, or targeted attacks, bring them down completely. […] networks are very different from engineered systems. In an airplane, damaging one element is enough to stop the whole machine. In order to make it more resilient, we have to use strategies such as duplicating certain pieces of the plane: this makes it almost 100 per cent safe. In contrast, networks, which are mostly not blueprinted, display a natural resilience to a broad range of errors, but when certain elements fail, they collapse. […] A random graph of the size of most real-world networks is destroyed after the removal of half of the nodes. On the other hand, when the same procedure is performed on a heterogeneous network (either a map of a real network or a scale-free model of a similar size), the giant connected component resists even after removing more than 80 per cent of the nodes, and the distance within it is practically the same as at the beginning. The scene is different when researchers simulate a targeted attack […] In this situation the collapse happens much faster […]. However, now the most vulnerable is the second: while in the homogeneous network it is necessary to remove about one-fifth of its more connected nodes to destroy it, in the heterogeneous one this happens after removing the first few hubs. Highly connected nodes seem to play a crucial role, in both errors and attacks. […] hubs are mainly responsible for the overall cohesion of the graph, and removing a few of them is enough to destroy it.”

“Studies of errors and attacks have shown that hubs keep different parts of a network connected. This implies that they also act as bridges for spreading diseases. Their numerous ties put them in contact with both infected and healthy individuals: so hubs become easily infected, and they infect other nodes easily. […] The vulnerability of heterogeneous networks to epidemics is bad news, but understanding it can provide good ideas for containing diseases. […] if we can immunize just a fraction, it is not a good idea to choose people at random. Most of the times, choosing at random implies selecting individuals with a relatively low number of connections. Even if they block the disease from spreading in their surroundings, hubs will always be there to put it back into circulation. A much better strategy would be to target hubs. Immunizing hubs is like deleting them from the network, and the studies on targeted attacks show that eliminating a small fraction of hubs fragments the network: thus, the disease will be confined to a few isolated components. […] in the epidemic spread of sexually transmitted diseases the timing of the links is crucial. Establishing an unprotected link with a person before they establish an unprotected link with another person who is infected is not the same as doing so afterwards.”

April 3, 2018 Posted by | Biology, Books, Ecology, Engineering, Epidemiology, Genetics, Mathematics, Statistics | Leave a comment