Simultaneously Self Attending to All Mentions for Full Abstract Biological Relation Extraction

Transcript

1 Simultaneously Self-Attending to All Mentions for Full-Abstract Biological Relation Extraction Patrick Verga, Emma Strubell, Andrew McCallum College of Information and Computer Sciences University of Massachusetts Amherst {pat, strubell, mccallum}@cs.umass.edu Abstract are expressed across sentence boundaries, such as in the following excerpt expressing a relationship be- Most work in relation extraction forms a tween the chemical and the disease azathioprine prediction by looking at a short span of : fibrosis text within a single sentence containing a Treatment of psoriasis with . azathioprine single entity pair mention. This approach Azathioprine treatment benefited 19 (66%) often does not consider interactions across out of 29 patients suffering from severe pso- mentions, requires redundant computation riasis. Haematological complications were for each mention pair, and ignores rela- not troublesome and results of biochemical liver function tests remained normal. Min- tionships expressed across sentence bound- imal cholestasis was seen in two cases and aries. These problems are exacerbated by fibrosis portal of a reversible degree in eight. the document- (rather than sentence-) level Liver biopsies should be undertaken at regular annotation common in biological text. In therapy is contin- intervals if azathioprine response, we propose a model which simul- ued so that structural liver damage may be detected at an early and reversible stage. taneously predicts relationships between all mention pairs in a document. We form pair- Though the entities’ mentions never occur in the wise predictions over entire paper abstracts same sentence, the above example expresses that using an efficient self-attention encoder. All- the chemical entity can cause the side azathioprine pairs mention scores allow us to perform effect fibrosis . Relation extraction models which multi-instance learning by aggregating over consider only within-sentence relation pairs can- mentions to form entity pair representa- not extract this fact without knowledge of the tions. We further adapt to settings without eight complicated coreference relationship between mention-level annotation by jointly training azathioprine treatment and , which, without features to predict named entities and adding a cor- from a complicated pre-processing pipeline, cannot pus of weakly labeled data. In experiments be learned by a model which considers entity pairs on two Biocreative benchmark datasets, we in isolation. Making separate predictions for each achieve state of the art performance on the mention pair also obstructs multi-instance learning Biocreative V Chemical Disease Relation (Riedel et al., 2010; Surdeanu et al., 2012), a tech- dataset for models without external KB re- nique which aggregates entity representations from sources. We also introduce a new dataset mentions in order to improve robustness to noise in an order of magnitude larger than existing the data. Like the majority of relation extraction human-annotated biological information ex- data, most annotation for biological relations is dis- traction datasets and more accurate than tantly supervised, and so we could benefit from a distantly supervised alternatives. model which is amenable to multi-instance learning. In addition to this loss of cross-sentence and 1 Introduction cross-mention reasoning capability, traditional men- With few exceptions (Swampillai and Stevenson, tion pair relation extraction models typically intro- duce computational inefficiencies by independently 2011; Quirk and Poon, 2017; Peng et al., 2017), nearly all work in relation extraction focuses on clas- extracting features for and scoring every pair of mentions, even when those mentions occur in the sifying a short span of text within a single sentence containing a single entity pair mention. However, same sentence and thus could share representations. relationships between entities are often expressed In the CDR training set, this requires separately across sentence boundaries or otherwise require a encoding and classifying each of the 5,318 candi- date mention pairs independently, versus encoding larger context to disambiguate. For example, 30% of relations in the Biocreative V CDR dataset (§3.1) each of the 500 abstracts once. Though abstracts 872 , pages 872–884 Proceedings of NAACL-HLT 2018 c 2018 Association for Computational Linguistics © New Orleans, Louisiana, June 1 - 6, 2018.

2 1 are longer than e.g. the text between mentions, extraction systems. many sentences contain multiple mentions, leading to redundant computation. 2 Model However, encoding long sequences in a way which We designed our model to efficiently encode long effectively incorporates long-distance context can be contexts spanning multiple sentences while forming prohibitively expensive. Long Short Term Memory pairwise predictions without the need for mention (LSTM) networks (Hochreiter and Schmidhuber, pair-specific features. To do this, our model first en- 1997) are among the most popular token encoders codes input token embeddings using self-attention. due to their capacity to learn high-quality repre- These embeddings are used to predict both entities sentations of text, but their ability to leverage the and relations. The relation extraction module con- fastest computing hardware is thwarted due to their verts each token to a head and tail representation. computational dependence on the length of the se- These representations are used to form mention quence — each token’s representation requires as pair predictions using a bi-affine operation with re- input the representation of the previous token, lim- spect to learned relation embeddings. Finally, these iting the extent to which computation can be par- mention pair predictions are pooled to form entity allelized. Convolutional neural networks (CNNs), pair predictions, expressing whether each relation in contrast, can be executed entirely in parallel type is expressed by each relation pair. across the sequence, but the amount of context incorporated into a single token’s representation 2.1 Inputs is limited by the depth of the network, and very Our model takes in a sequence of token em- N deep networks can be difficult to learn (Hochreiter, d R . Because the Transformer has no beddings in 1998). These problems are exacerbated by longer innate notion of token position, the model relies sequences, limiting the extent to which previous on positional embeddings which are added to the work explored full-abstract relation extraction. 2 input token embeddings. We learn the position d × m To facilitate efficient full-abstract relation ex- which contains a sepa- embedding matrix P traction from biological text, we propose Bi-affine dimensional embedding for each position, d rate Relation Attention Networks (BRANs), a combi- limited to m possible positions. Our final input nation of network architecture, multi-instance and is: x representation for token i multi-task learning designed to extract relations be- tween entities in biological text without requiring = x p + s i i i explicit mention-level annotation. We synthesize convolutions and self-attention, a modification of x is the token embedding for is s where and p i i i the Transformer encoder introduced by Vaswani i the positional embedding for the i th position. If et al. (2017), over sub-word tokens to efficiently exceeds m , we use a randomly initialized vector in incorporate into token representations rich context p place of . i between distant mention pairs across the entire ab- We tokenize the text using byte pair encoding stract. We score all pairs of mentions in parallel (BPE) (Gage, 1994; Sennrich et al., 2015). The using a bi-affine operator, and aggregate over men- BPE algorithm constructs a vocabulary of sub-word tion pairs using a soft approximation of the max pieces, beginning with single characters. Then, the function in order to perform multi-instance learning. algorithm iteratively merges the most frequent co- We jointly train the model to predict relations and occurring tokens into a new token, which is added entities, further improving robustness to noise and to the vocabulary. This procedure continues until lack of gold annotation at the mention level. a pre-defined vocabulary size is met. BPE is well suited for biological data for the fol- In extensive experiments on two benchmark bio- lowing reasons. First, biological entities often have logical relation extraction datasets, we achieve state unique mentions made up of meaningful subcompo- of the art performance for a model using no exter- nents, such as . Additionally, 1,2-dimethylhydrazine nal knowledge base resources in experiments on the tokenization of chemical entities is challenging, lack- Biocreative V CDR dataset, and outperform com- ing a universally agreed upon algorithm (Krallinger parable baselines on the Biocreative VI ChemProt et al., 2015). As we demonstrate in §3.3.2, the sub- dataset. We also introduce a new dataset which word representations produced by BPE allow the is an order of magnitude larger than existing gold- model to formulate better predictions, likely due to annotated biological relation extraction datasets better modeling of rare and unknown words. while covering a wider range of entity and relation types and with higher accuracy than distantly su- 1 Our code and data are publicly available at: https: pervised datasets of the same size. We provide a //github.com/patverga/bran . 2 strong baseline on this new dataset, and encourage Though our final model incorporates some convolu- its use as a benchmark for future biological relation tions, we retain the position embeddings. 873

3 Entity-Pair B blocks. Each Trans- Transformer is made up of Relation Scores Transformer former block, which we denote , has k its own set of parameters and is made up of two subcomponents: multi-head attention and a series LogSumExp 3 of convolutions , k of block i . The output for token 1) k ( − ) k ( b , is connected to its input with a resid- b i i Hypersensitivity to ual connection (He et al., 2016). Starting with 4.3 carbamazepine (0) presenting = x : b with i i leukemoid reaction, ( ) k 1) ( k − 1) ( k − eosinophilia, b b b = + Transformer ) ( k i i i and erythroderma . The 2.2.1 Multi-head Attention First report Multi-head attention applies self-attention multiple of such times over the same inputs using separately nor- reaction to 0.1 malized parameters (attention heads) and combines . carbamazepine the results, as an alternative to applying one pass Relations of attention with more parameters. The intuition behind this modeling decision is that dividing the attention into multiple heads make it easier for the model to learn to attend to different types of rele- vant information with each head. The self-attention Head MLP Tail MLP − ( k 1) updates input b by performing a weighted sum i over all tokens in the sequence, weighted by their importance for modeling token i . Each input is projected to a key , and v , value k query q , using separate affine transformations with Convolutions , k ReLU activations (Glorot et al., 2011). Here, Bx d Multi-head H q R , and H where v is the number are each in Attention a for head h of heads. The attention weights ijh Transformer are computed using scaled j and i between tokens dot-product attention: First The erythroderma. carbamazepine presenting carbamazepine eosinophilia, Hypersensitivity to report and reaction, such with leukemoid to reaction of ( ) T k q jh ih √ = σ a ijh d ∑ v a = o ih jh ijh Figure 1: The relation extraction architecture. j Inputs are contextually encoded using the Trans- B former(Vaswani et al., 2017), made up of with σ denoting element-wise multiplication and layers of multi-head attention and convolution th dimension. The j indicating a softmax along the subcomponents. Each transformed token is then scaled attention is meant to aid optimization by passed through a tail and head MLP to produce flattening the softmax and better distributing the two position-specific representations. A bi-affine gradients (Vaswani et al., 2017). operation is performed between each head and The outputs of the individual attention heads representation with respect to each rela- tail ] · ; · [ . All layers o are concatenated, denoted , into i tion’s embedding matrix, producing a pair-wise in the network use residual connections between relation affinity tensor. Finally, the scores for the output of the multi-headed attention and its in- cells corresponding to the same entity pair are put. Layer normalization (Ba et al., 2016), denoted pooled with a separate LogSumExp operation LN · ) ( , is then applied to the output. for each relation to get a final score. The colored tokens illustrate calculating the score for a given ... o ] o ; = [ o ; 1 i h pair of entities; the model is only given entity ( k − 1) = LN( b ) o m + information when pooling over mentions. i i i 2.2.2 Convolutions The second part of our Transformer block is a stack of convolutional layers. The sub-network used in 2.2 Transformer 3 The original Transformer uses feed-forward con- We base our token encoder on the Transformer nections, i.e. width-1 convolutions, whereas we use self-attention model (Vaswani et al., 2017). The convolutions with width > 1. 874

4 Vaswani et al. (2017) uses two width-1 convolu- network attention (Verga and McCallum, 2016; Lin tions. We add a third middle layer with kernel et al., 2016; Yaghoobzadeh et al., 2017). width 5, which we found to perform better. Many We aggregate over all representations for each relations are expressed concisely by the immediate mention pair in order to produce per-relation Michele’s husband Barack local context, e.g. , or scores for each entity pair. For each entity pair head tail head labetalol-induced hypotension . Adding this explicit , let P denote the set of indices of ( p ,p ) tail head n-gram modeling is meant to ease the burden on P mentions of the entity p , and let denote tail the model to learn to attend to local features. We p the indices of mentions of the entity . Then · ( C ) to denote a convolutional operator with use we use the LogSumExp function to aggregate the w w . Then the convolutional portion of kernel width across all pairs of mentions A relation scores from head tail the transformer block is given by: of p and p : ∑ (0) tail head m ( )) t C = ReLU( ,p p ) = log ( scores ) A exp( 1 i i ij head (1) (0) i ∈ P t )) ( t C = ReLU( 5 tail i i j P ∈ (1) (2) C ) = ( t t 1 i i The LogSumExp scoring function is a smooth ap- (1) (0) proximation to the max function and has the bene- d 4 R and are in t Where the dimensions of t i i fits of aggregating information from multiple predic- (2) d and that of R . t is in i tions and propagating dense gradients as opposed to the sparse gradient updates of the max (Das 2.3 Bi-affine Pairwise Scores et al., 2017). B ) ( We project each contextually encoded token b i through two separate MLPs to generate two new 2.5 Named Entity Recognition versions of each token corresponding to whether In addition to pairwise relation predictions, we it will serve as the first (head) or second (tail) ) B ( use the Transformer output to make entity b i argument of a relation: ( ) B type predictions. We feed b as input to a linear i ( (0) ) B (1) head classifier which predicts the entity label for each )) (ReLU( W W b e = i i head head c token with per-class scores : i (0) ) B ( (1) tail W e (ReLU( b = W )) i i tail tail B ( ) (3) W = b c i i L N We use a bi-affine operator to calculate an N × × tensor A of pairwise affinity scores, scoring each We augment the entity type labels with the BIO (head, relation, tail) triple: encoding to denote entity spans. We apply tags head tail to the byte-pair tokenization by treating each sub- L ) e e A = ( ilj j i word within a mention span as an additional token with a corresponding B- or I- label. tensor, a learned embedding d L d × where L × is a L relations. In subsequent matrix for each of the 2.6 Training sections we will assume we have transposed the We train both the NER and relation extraction com- d A dimensions of as × d × L for ease of indexing. ponents of our network to perform multi-class clas- 2.4 Entity Level Prediction sification using maximum likelihood, where NER Our data is weakly labeled in that there are labels y classes are conditionally r or relation classes i i at the entity level but not the mention level, making independent given deep features produced by our the problem a form of strong-distant supervision model with probabilities given by the softmax func- (Mintz et al., 2009). In distant supervision, edges tion. In the case of NER, features are given by the in a knowledge graph are heuristically applied to per-token output of the transformer: sentences in an auxiliary unstructured text corpus N ∑ — often applying the edge label to all sentences 1 ( ) B y | b P ( ) log i i containing the subject and object of the relation. N =1 i Because this process is imprecise and introduces noise into the training data, methods like multi- In the case of relation extraction, the features for instance learning were introduced (Riedel et al., each entity pair are given by the LogSumExp over 2010; Surdeanu et al., 2012). In multi-instance entity pairwise scores described in § 2.4. For E learning, rather than looking at each distantly la- r pairs, the relation is given by: i beled mention pair in isolation, the model is trained E over the aggregate of these mentions and a single ∑ 1 tail head P ,p p log )) ( r ( | scores update is made. More recently, the weighting func- i E =1 i tion of the instances has been expressed as neural 875

5 We train the NER and relation objectives jointly, and filter hypernyms according to the hierarchy 5 . All entity in the MESH controlled vocabulary sharing all embeddings and Transformer parame- pairs within the same abstract that do not have an ters. To trade off the two objectives, we penalize annotated relation are assigned the NULL label. the named entity updates with a hyperparameter λ . In addition to the gold CDR data, Peng et al. (2016) add 15,448 PubMed abstracts annotated in 3 Results the CTD dataset. We consider this same set of abstracts as additional training data (which we We evaluate our model on three datasets: The subsequently denote +Data). Since this data does Biocreative V Chemical Disease Relation bench- not contain entity annotations, we take the anno- mark (CDR), which models relations between tations from Pubtator (Wei et al., 2013), a state chemicals and diseases (§3.1); the Biocreative VI of the art biological named entity tagger and en- ChemProt benchmark (CPR), which models rela- tity linker. See §A.1 for additional data processing tions between chemicals and proteins (§3.2); and a details. In our experiments we only evaluate our new, large and accurate dataset we describe in §3.3 relation extraction performance and all models (in- based on the human curation in the Chemical Toxi- cluding baselines) use gold entity annotations for cology Database (CTD), which models relationships predictions. between chemicals, proteins and genes. The byte pair vocabulary is generated over the The CDR dataset is annotated at the level of training dataset — we use a budget of 2500 tokens paper abstracts, requiring consideration of long- when training on the gold CDR data, and a larger range, cross sentence relationships, thus evaluation budget of 10,000 tokens when including extra data on this dataset demonstrates that our model is described above Additional implementation details capable of such reasoning. We also evaluate our are included in Appendix A. model’s performance in the more traditional setting which does not require cross-sentence modeling by Neg Docs Pos Data split performing experiments on the CPR dataset, for 1,038 4,280 Train 500 which all annotations are between two entity men- Development 500 1,012 4,136 tions in a single sentence. Finally, we present a 1,066 4,270 Test 500 new dataset constructed using strong-distant su- CTD 15,448 26,657 146,057 pervision (§2.4), with annotations at the document level. This dataset is significantly larger than the Table 1: Data statistics for the CDR Dataset and others, contains more relation types, and requires additional data from CTD. Shows the total num- reasoning across sentences. ber of abstracts, positive examples, and negative examples for each of the data set splits. 3.1 Chemical Disease Relations Dataset The Biocreative V chemical disease relation extrac- 4 tion (CDR) dataset (Li et al., 2016a; Wei et al., 3.1.2 Baselines 2016) was derived from the Comparative Toxicoge- We compare against the previous best reported nomics Database (CTD), which curates interactions results on this dataset not using knowledge base between genes, chemicals, and diseases (Davis et al., 6 Each of the baselines are ensemble meth- features. 2008). CTD annotations are only at the document ods for within- and cross-sentence relations that level and do not contain mention annotations. The make use of additional linguistic features (syntactic CDR dataset is a subset of these original annota- parse and part-of-speech). Gu et al. (2017) en- tions, supplemented with human annotated, entity code mention pairs using a CNN while Zhou et al. linked mention annotations. The relation annota- (2016a) use an LSTM. Both make cross-sentence tions in this dataset are also at the document level predictions with featurized classifiers. only. 3.1.3 Results 3.1.1 Data Preprocessing In Table 2 we show results outperforming the base- The CDR dataset is concerned with extracting lines despite using no linguistic features. We show only chemically-induced disease relationships (drug- performance averaged over 20 runs with 20 random related side effects and adverse reactions) concern- seeds as well as an ensemble of their averaged pre- ing the most specific entity in the document. For dictions. We see a further boost in performance could be marked as example tobacco causes cancer by adding weakly labeled data. Table 3 shows the false if the document contained the more specific lung cancer . This can cause true relations to be 5 https://www.nlm.nih.gov/mesh/download/ labeled as false, harming evaluation performance. 2017MeshTree.txt 6 To address this we follow (Gu et al., 2016, 2017) The highest reported score is from (Peng et al., 2016), but they use explicit lookups into the CTD 4 http://www.biocreative.org/ knowledge base for the existence of the test entity pair. 876

6 Model F1 R P Model P R F1 CNN 62.0 55.1 58.3 50.7 43.0 46.5 Gu et al. (2016) † † Zhou et al. (2016a) 55.6 68.4 61.3 GRU+Attention 53.0 46.3 49.5 55.7 68.1 61.3 BRAN Gu et al. (2017) 48.0 54.1 50.8 ± .01 ± 55.6 70.8 BRAN 0.8 62.1 Table 4: Precision, recall, and F1 results on the 66.2 ± 0.8 + Data 64.0 69.2 † Biocreative VI Chem-Prot Dataset. denotes re- BRAN(ensemble) 63.3 67.1 65.1 sults from Liu et al. (2017) + Data 65.4 71.8 68.4 Table 2: Precision, recall, and F1 results on the unsupervised pre-training. Biocreative V CDR Dataset. 3.2.2 Results Model P R F1 In Table 4 we see that even though our model 62.1 ± 0.8 BRAN (Full) 55.6 70.8 forms all predictions simultaneously between all – CNN only 43.9 65.5 52.4 ± 1.3 pairs of entities within the sentence, we are able ± 0.9 – no width-5 48.2 67.2 55.7 to outperform state of the art models classifying 1.8 ± – no NER 49.9 63.8 55.5 each mention pair independently. The scores shown are averaged across 10 runs with 10 random seeds. Table 3: Results on the Biocreative V CDR Dataset Interestingly, our model appears to have higher showing precision, recall, and F1 for various model recall and lower precision, while the baseline models ablations. are both precision-biased, with lower recall. This suggests that combining these styles of model could lead to further gains on this task. effects of ablating pieces of our model. ‘CNN only’ removes the multi-head attention component from the transformer block, ‘no width-5’ replaces the 3.3 New CTD Dataset width-5 convolution of the feed-forward component 3.3.1 Data of the transformer with a width-1 convolution and Existing biological relation extraction datasets in- ‘no NER’ removes the named entity recognition cluding both CDR (§3.1) and CPR (§3.2) are rela- multi-task objective (§2.5). tively small, typically consisting of hundreds or a 3.2 Chemical Protein Relations Dataset few thousand annotated examples. Distant supervi- sion datasets apply document-independent, entity- To assess our model’s performance in settings where level annotations to all sentences leading to a large cross-sentence relationships are not explicitly evalu- proportion of incorrect labels. Evaluations on this ated, we perform experiments on the Biocreative VI data involve either very small (a few hundred) gold ChemProt dataset (CDR) (Krallinger et al., 2017). annotated examples or cross validation to predict This dataset is concerned with classifying into six the noisy, distantly applied labels (Mallory et al., relation types between chemicals and proteins, with 2015; Quirk and Poon, 2017; Peng et al., 2017). nearly all annotated relationships occurring within We address these issues by constructing a new the same sentence. dataset using strong-distant supervision containing 3.2.1 Baselines document-level annotations. The Comparative Tox- icogenomics Database (CTD) curates interactions We compare our models against those competing in between genes, chemicals, and diseases. Each rela- the official Biocreative VI competition (Liu et al., tion in the CTD is associated with a disambiguated 2017). We compare to the top performing team entity pair and a PubMed article where the relation whose model is directly comparable with ours — i.e. was observed. used a single (non-ensemble) model trained only on To construct this dataset, we collect the abstracts the training data (many teams use the development for each of the PubMed articles with at least one set as additional training data). The baseline mod- els are standard state of the art relation extraction curated relation in the CTD database. As in §3.1, we use PubTator to automatically tag and disam- models: CNNs and Gated RNNs with attention. biguate the entities in each of these abstracts. If Each of these baselines uses mention-specific fea- both entities in the relation are found in the ab- tures encoding relative position of each token to stract, we take the (abstract, relation) pair as a the two target entities being classified, whereas our model aggregates over all mention pairs in each sen- positive example. The evidence for the curated re- tence. It is also worth noting that these models use lation could occur anywhere in the full text article, a large vocabulary of pre-trained word embeddings, not just the abstract. Abstracts with no recovered giving their models the advantage of far more model relations are discarded. All other entity pairs with parameters, as well as additional information from valid types and without an annotated relation that 877

7 Types Neg Docs Pos Train Dev Test Total 68,400 166,474 1198,493 120k 15k 15k Total 64,139 93,940 571,932 Chemical/Disease Chemical/Disease Chemical/Gene 34,883 63,463 360,100 41,562 5,126 5,167 marker/mechanism 266,461 Gene/Disease 32,286 9,071 24,151 2,929 3,059 therapeutic Gene/Disease Table 5: Data statistics for the new CTD dataset. marker/mechanism 5,930 825 819 75 77 560 therapeutic Chemical/Gene occur in the remaining abstracts are considered neg- increase_expression 15,851 1,958 2,137 ative examples and assigned the NULL label. We 5,986 740 638 increase_MP additionally remove abstracts containing greater decrease_expression 5,870 698 783 7 . This limit removed about 10% of than 500 tokens 4,154 467 497 increase_activity the total data including numerous extremely long 3,834 475 508 affects_response abstracts. The average token length of the remain- decrease_activity 3,124 396 434 ̃ ing data was 230 tokens. With this procedure, we affects_transport 3,009 333 361 are able to collect 166,474 positive examples over 2,881 367 353 increase_reaction 13 relation types, with more detailed statistics of 2,221 247 269 decrease_reaction the dataset listed in Table 5. 798 100 120 decrease_MP We consider relations between chemical-disease, chemical-gene, and gene-disease entity pairs down- Table 6: Data statistics for the new CTD dataset 8 loaded from CTD . We remove inferred relations broken down by relation type. The first column lists (those without an associated PubMed ID) and con- relation types separated by the types of the entities. sider only human curated relationships. Some Columns 2–4 show the number of positive examples chemical-gene entity pairs were associated with of that relation type. MP stands for metabolic multiple relation types in the same document. We processing. consider each of these relation types as a separate positive example. The chemical-gene relation data contains over vocabulary (OOV) rate for named entities. Word 100 types organized in a shallow hierarchy. Many training data has 3.01 percent OOV rate for tokens of these types are extremely infrequent, so we map with an entity. The byte pair-encoded data has an all relations to the highest parent in the hierar- OOV rate of 2.48 percent. Note that in both the chy, resulting in 13 relation types. Most of these word-tokenized and byte pair-tokenized data, we chemical-gene relations have an increase and de- replace tokens that occur less than five times with crease version such as increase_expression and de- a learned UNK token. crease_expression. In some cases, there is also an Figure 2 depicts the model’s performance on re- affects relation (affects_expression) which is used lation extraction as a function of distance between when the directionality is unknown. If the affects entities. For example, the blue bar depicts perfor- version is more common, we map decrease and in- mance when removing all entity pair candidates crease to affects. If affects is less common, we drop (positive and negative) whose closest mentions are the affects examples and keep the increase and de- more than 11 tokens apart. We consider remov- crease examples as distinct relations, resulting in ing entity pair candidates with distances of 11, 25, the final set of 10 chemical-gene relation types. 50, 100 and 500 (the maximum document length). The average sentence length is 22 tokens. We see 3.3.2 Results that the model is not simply relying on short range In Table 7 we list precision, recall and F1 achieved relationships, but is leveraging information about by our model on the CTD dataset, both overall and distant entity pairs, with accuracy increasing as the by relation type. Our model predicts each of the maximum distance considered increases. Note that relation types effectively, with higher performance all results are taken from the same model trained on relations with more support. on the full unfiltered training set. In Table 8 we see that our sub-word BPE model out-performs the model using the Genia tokenizer (Kulick et al., 2012) even though our vocabulary 4 Related work size is one-fifth as large. We see a 1.7 F1 point Relation extraction is a heavily studied area in the boost in predicting Pubtator NER labels for BPE. NLP community. Most work focuses on news and This could be explained by the increased out-of- web data (Doddington et al., 2004; Riedel et al., 7 9 We include scripts to generate the unfiltered set of 2010; Hendrickx et al., 2009). Recent neural net- data as well to encourage future research 9 8 And TAC KBP: https://tac.nist.gov http://ctdbase.org/downloads/ 878

8 0.6 P F1 R 11 25 Total 50 0.5 100 44.8 50.2 47.3 Micro F1 500 Macro F1 34.0 29.8 31.7 0.4 Chemical/Disease 0.3 marker/mechanism 46.2 57.9 51.3 F1 Score 55.7 67.1 60.8 therapeutic 0.2 Gene/Disease marker/mechanism 42.2 44.4 43.0 0.1 52.6 10.1 15.8 therapeutic 0.0 Chemical/Gene all gene_disease chem_disease chem_gene Dataset 39.7 48.0 43.3 increases_expression 26.3 35.5 29.9 increases_MP Figure 2: Performance on the CTD dataset when 34.4 32.9 33.4 decreases_expression restricting candidate entity pairs by distance. The increases_activity 24.5 24.7 24.4 x-axis shows the coarse-grained relation type. The 40.9 35.5 37.4 affects_response y-axis shows F1 score. Different colors denote max- 30.8 19.4 23.5 decreases_activity imum distance cutoffs. 28.7 23.8 25.8 affects_transport 12.8 5.6 7.4 increases_reaction 12.3 5.7 7.4 decreases_reaction sentence level. decreases_MP 28.9 7.0 11.0 Some previous work exists on cross-sentence relation extraction. Swampillai and Stevenson Table 7: BRAN precision, recall and F1 results for (2011) and Quirk and Poon (2017) consider featur- the full CTD dataset by relation type. The model ized classifiers over cross-sentence syntactic parses. is optimized for micro F1 score across all types. Most similar to our work is that of Peng et al. (2017), which uses a variant of an LSTM to encode Model P R F1 document-level syntactic parse trees. Our work Relation extraction differs in three key ways. First, we operate over ± Words 44.9 48.8 46.7 0.39 raw tokens negating the need for part-of-speech BPE 47.3 ± 44.8 50.2 0.19 or syntactic parse features which can lead to cas- NER cading errors. We also use a feed-forward neural Words 91.0 90.7 90.9 ± 0.13 architecture which encodes long sequences far more 0.12 ± BPE 91.5 93.6 92.6 efficiently compared to the graph LSTM network of Peng et al. (2017). Finally, our model considers all Table 8: Precision, recall, and F1 results for CTD mention pairs simultaneously rather than a single named entity recognition and relation extraction, mention pair at a time. comparing BPE to word-level tokenization. We employ a bi-affine function to form pairwise predictions between mentions. Such models have also been used for knowledge graph link prediction work approaches to relation extraction have focused (Nickel et al., 2011; Li et al., 2016b), with variations on CNNs (dos Santos et al., 2015; Zeng et al., 2015) such as restricting the bilinear relation matrix to or LSTMs (Miwa and Bansal, 2016; Verga et al., be diagonal (Yang et al., 2015) or diagonal and 2016a; Zhou et al., 2016b) and replacing stage-wise complex (Trouillon et al., 2016). Our model is information extraction pipelines with a single end- similar to recent approaches to graph-based depen- to-end model (Miwa and Bansal, 2016; Ammar dency parsing, where bilinear parameters are used et al., 2017; Li et al., 2017). These models all to score head-dependent compatibility (Kiperwasser consider mention pairs separately. and Goldberg, 2016; Dozat and Manning, 2017). There is also a considerable body of work specifi- cally geared towards supervised biological relation 5 Conclusion extraction including protein-protein (Pyysalo et al., 2007; Poon et al., 2014; Mallory et al., 2015), drug- We present a bi-affine relation attention network drug (Segura-Bedmar et al., 2013), and chemical- that simultaneously scores all mention pairs within disease (Gurulingappa et al., 2012; Li et al., 2016a) a document. Our model performs well on three interactions, and more complex events (Kim et al., datasets, including two standard benchmark biolog- 2008; Riedel et al., 2011). Our work focuses on mod- ical relation extraction datasets and a new, large eling relations between chemicals, diseases, genes and high-quality dataset introduced in this work. and proteins, where available annotation is often Our model out-performs the previous state of the at the document- or abstract-level, rather than the art on the Biocreative V CDR dataset despite us- 879

9 ing no additional linguistic resources or mention knowledgebase and discovery tool for chemical– gene–disease networks. Nucleic acids research pair-specific features. 37(suppl_1):D786–D792. Our current model predicts only into a fixed schema of relations given by the data. However, George Doddington, Alexis Mitchell, Mark Przy- this could be ameliorated by integrating our model bocki, Lance Ramshaw, Stephanie Strassel, and into open relation extraction architectures such Ralph Weischedel. 2004. The automatic content as Universal Schema (Riedel et al., 2013; Verga extraction (ace) program tasks, data, and evalu- et al., 2016b). Our model also lends itself to other ation. In Proceedings of the Fourth International Conference on Language Resources and Evalua- pairwise scoring tasks such as hypernym prediction, . tion co-reference resolution, and entity resolution. We will investigate these directions in future work. Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with con- Acknowledgments Proceedings of the volutional neural networks. In 53rd Annual Meeting of the Association for Com- We thank Ofer Shai and the Chan Zuckerberg Initia- putational Linguistics and the 7th International tive / Meta data science team for helpful discussions. Joint Conference on Natural Language Process- We also thank Timothy Dozat and Kyubyong Park ing (Volume 1: Long Papers) . Association for for releasing their code. Computational Linguistics, Beijing, China, pages http://www.aclweb.org/anthology/ 626–634. . P15-1061 References Timothy Dozat and Christopher D Manning. 2017. Martín Abadi, Ashish Agarwal, Paul Barham, Deep biaffine attention for neural dependency Eugene Brevdo, Zhifeng Chen, Craig Citro, 5th International Conference on Learn- parsing. Greg S. Corrado, Andy Davis, Jeffrey Dean, . ing Representations Matthieu Devin, Sanjay Ghemawat, Ian Good- Philip Gage. 1994. A new algorithm for data com- fellow, Andrew Harp, Geoffrey Irving, Michael 12(2):23–38. The C Users Journal pression. Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Mané, Rajat Monga, Sherry Moore, Derek Mur- 2011. Deep sparse rectifier neural networks. In ray, Chris Olah, Mike Schuster, Jonathon Shlens, Proceedings of the Fourteenth International Con- Benoit Steiner, Ilya Sutskever, Kunal Talwar, . ference on Artificial Intelligence and Statistics Paul Tucker, Vincent Vanhoucke, Vijay Vasude- pages 315–323. van, Fernanda Viégas, Oriol Vinyals, Pete War- den, Martin Wattenberg, Martin Wicke, Yuan Jinghang Gu, Longhua Qian, and Guodong Zhou. Yu, and Xiaoqiang Zheng. 2015. TensorFlow: 2016. Chemical-induced disease relation extrac- Large-scale machine learning on heterogeneous Database tion with various linguistic features. systems. Software available from tensorflow.org. 2016. . http://tensorflow.org/ Jinghang Gu, Fuqing Sun, Longhua Qian, and Waleed Ammar, Matthew E. Peters, Chandra Bha- Guodong Zhou. 2017. Chemical-induced disease gavatula, and Russell Power. 2017. The ai2 sys- relation extraction via convolutional neural net- tem at semeval-2017 task 10 (scienceie): semi- 2017. work. Database supervised end-to-end entity and relation extrac- Harsha Gurulingappa, Abdul Mateen Rajput, An- nucleus tion. 2(e2):e2. gus Roberts, Juliane Fluck, Martin Hofmann- Jimmy Lei Ba, Jamie Ryan Kiros, and Geof- Apitius, and Luca Toldo. 2012. Development of frey E Hinton. 2016. Layer normalization. arXiv a benchmark corpus to support the automatic . preprint arXiv:1607.06450 extraction of drug-related adverse effects from medical case reports. Journal of biomedical in- Rajarshi Das, Arvind Neelakantan, David Belanger, 45(5):885–892. formatics and Andrew McCallum. 2017. Chains of reason- ing over entities, relations, and text using recur- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Proceedings of the 15th rent neural networks. In Jian Sun. 2016. Deep residual learning for image Conference of the European Chapter of the Asso- recognition. In Proceedings of the IEEE confer- ciation for Computational Linguistics: Volume 1, ence on computer vision and pattern recognition . . Association for Computational Lin- Long Papers pages 770–778. http: guistics, Valencia, Spain, pages 132–141. Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, . //www.aclweb.org/anthology/E17-1013 Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, Allan Peter Davis, Cynthia G Murphy, Cyn- thia A Saraceni-Richards, Michael C Rosenstein, and Stan Szpakowicz. 2009. Semeval-2010 task Thomas C Wiegers, and Carolyn J Mattingly. 8: Multi-way classification of semantic relations 2008. Comparative toxicogenomics database: a between pairs of nominals. In Proceedings of 880

10 the Workshop on Semantic Evaluations: Recent Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gim- . Association Achievements and Future Directions pel. 2016b. Commonsense knowledge base com- pletion. In for Computational Linguistics, pages 94–99. Proceedings of the 54th Annual Meeting of the Association for Computational Sepp Hochreiter. 1998. The vanishing gradient . Associa- Linguistics (Volume 1: Long Papers) problem during learning recurrent neural nets tion for Computational Linguistics, Berlin, Ger- and problem solutions. International Journal http://www.aclweb. many, pages 1445–1455. of Uncertainty, Fuzziness and Knowledge-Based . org/anthology/P16-1137 Systems 6(2):107–116. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Sepp Hochreiter and Jürgen Schmidhuber. 1997. Luan, and Maosong Sun. 2016. Neural rela- Neural computation Long short-term memory. tion extraction with selective attention over in- 9(8):1735–1780. stances. In Proceedings of the 54th Annual Meeting of the Association for Computational Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii. . Associa- Linguistics (Volume 1: Long Papers) 2008. Corpus annotation for mining biomedi- tion for Computational Linguistics, Berlin, Ger- cal events from literature. BMC bioinformatics http://www.aclweb. many, pages 2124–2133. 9(1):10. . org/anthology/P16-1200 Diederik Kingma and Jimmy Ba. 2015. Adam: A Sijia Liu, Feichen Shen, Yanshan Wang, Ma- method for stochastic optimization. In 3rd Inter- jid Rastegar-Mojarad, Ravikumar Komandur national Conference for Learning Representations Elayavilli, Vipin Chaundary, and Hongfang Liu. . San Diego, California, USA. (ICLR) 2017. Attention-based neural networks for chem- ical protein relation extraction. Proceedings of Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim- . the BioCreative VI Workshop ple and accurate dependency parsing using bidi- Trans- rectional lstm feature representations. Emily K Mallory, Ce Zhang, Christopher Ré, and actions of the Association for Computational Russ B Altman. 2015. Large-scale extraction of 4:313–327. https://transacl.org/ Linguistics gene interactions from full-text literature using ojs/index.php/tacl/article/view/885 . Bioinformatics deepdive. 32(1):106–113. Martin Krallinger, Obdulia Rabal, Saber A. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Akhondi, Martín Pérez Pérez, Jesús Santa- frey Dean. 2013. Efficient estimation of word maría, Pérez Gael Rodríguez, Georgios Tsatsa- arXiv preprint representations in vector space. ronis, Ander Intxaurrondo, José Antonio López, arXiv:1301.3781 . Umesh Nandal, Erin Van Buel, Akileshwari Chan- Mike Mintz, Steven Bills, Rion Snow, and Daniel drasekhar, Marleen Rodenburg, Astrid Laegreid, Jurafsky. 2009. Distant supervision for relation Marius Doornenbal, Julen Oyarzabal, Analia extraction without labeled data. In Proceed- Lourenço, and Alfonso Valencia. 2017. Overview ings of the Joint Conference of the 47th Annual of the biocreative vi chemical-protein interaction Meeting of the ACL and the 4th International track. Proceedings of the BioCreative VI Work- Joint Conference on Natural Language Process- shop page 140. ing of the AFNLP . Association for Computa- Martin Krallinger, Obdulia Rabal, Florian Leit- tional Linguistics, Suntec, Singapore, pages 1003– ner, Miguel Vazquez, David Salgado, Zhiyong http://www.aclweb.org/anthology/P/ 1011. Lu, Robert Leaman, Yanan Lu, Donghong Ji, . P09/P09-1113 Daniel M Lowe, et al. 2015. The chemdner cor- Makoto Miwa and Mohit Bansal. 2016. End-to-end pus of chemicals and drugs and its annotation relation extraction using lstms on sequences and principles. Journal of cheminformatics 7(S1):S2. Proceedings of the 54th Annual tree structures. In Seth Kulick, Ann Bies, Mark Liberman, Mark Man- Meeting of the Association for Computational del, Scott Winters, and Pete White. 2012. In- Linguistics (Volume 1: Long Papers) . Associa- tegrated annotation for biomedical information tion for Computational Linguistics, Berlin, Ger- HLT/NAACL Workshop: Biolink . extraction. many, pages 1105–1116. http://www.aclweb. . org/anthology/P16-1105 Fei Li, Meishan Zhang, Guohong Fu, and Donghong Ji. 2017. A neural joint model for entity and Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya BMC relation extraction from biomedical text. Sutskever, Lukasz Kaiser, Karol Kurach, and bioinformatics 18(1):198. James Martens. 2015. Adding gradient noise arXiv improves learning for very deep networks. Jiao Li, Yueping Sun, Robin J Johnson, Daniela preprint arXiv:1511.06807 . Sciaky, Chih-Hsuan Wei, Robert Leaman, Al- lan Peter Davis, Carolyn J Mattingly, Thomas C Maximilian Nickel, Volker Tresp, and Hans-Peter Wiegers, and Zhiyong Lu. 2016a. Biocreative v Kriegel. 2011. A three-way model for collective cdr task corpus: a resource for chemical disease learning on multi-relational data. In Proceedings 2016. Database relation extraction. of the 28th international conference on machine 881

11 learning (ICML-11) . Bellevue, Washington, USA, Rico Sennrich, Barry Haddow, and Alexandra pages 809–816. Birch. 2015. Neural machine translation of arXiv preprint rare words with subword units. Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina . arXiv:1508.07909 Toutanova, and Wen-tau Yih. 2017. Cross- Nitish Srivastava, Geoffrey E Hinton, Alex sentence n-ary relation extraction with graph Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- Transactions of the Association for Com- lstms. dinov. 2014. Dropout: a simple way to prevent 5:101–115. putational Linguistics Journal of ma- neural networks from overfitting. Yifan Peng, Chih-Hsuan Wei, and Zhiyong Lu. 2016. chine learning research 15(1):1929–1958. Improving chemical disease relation extraction Mihai Surdeanu, Julie Tibshirani, Ramesh Nallap- with rich features and weakly labeled data. Jour- ati, and Christopher D. Manning. 2012. Multi- nal of cheminformatics 8(1):53. instance multi-label learning for relation extrac- Hoifung Poon, Kristina Toutanova, and Chris Quirk. tion. In Proceedings of the 2012 Joint Confer- 2014. Distant supervision for cancer pathway ence on Empirical Methods in Natural Language Pacific Symposium on extraction from text. In Processing and Computational Natural Language . pages 120–131. Biocomputing Co-Chairs Learning . Association for Computational Linguis- http: tics, Jeju Island, Korea, pages 455–465. Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari //www.aclweb.org/anthology/D12-1042 . Björne, Jorma Boberg, Jouni Järvinen, and Tapio Salakoski. 2007. Bioinfer: a corpus for informa- Kumutha Swampillai and Mark Stevenson. 2011. BMC tion extraction in the biomedical domain. Extracting relations within and across sentences. 8(1):50. bioinformatics Proceedings of the International Conference In Recent Advances in Natural Language Process- Chris Quirk and Hoifung Poon. 2017. Distant super- ing 2011 . RANLP 2011 Organising Committee, vision for relation extraction beyond the sentence Hissar, Bulgaria, pages 25–32. http://aclweb. boundary. In Proceedings of the 15th Conference org/anthology/R11-1004 . of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- Théo Trouillon, Johannes Welbl, Sebastian Riedel, pers . Association for Computational Linguistics, Éric Gaussier, and Guillaume Bouchard. 2016. Valencia, Spain, pages 1171–1182. Complex embeddings for simple link prediction. International Conference on Machine Learn- In Sebastian Riedel, David McClosky, Mihai Surdeanu, ing . pages 2071–2080. Andrew McCallum, and Christopher D. Man- ning. 2011. Model combination for event extrac- Ashish Vaswani, Noam Shazeer, Niki Parmar, tion in bionlp 2011. In Proceedings of BioNLP Jakob Uszkoreit, Llion Jones, Aidan N Gomez, . Association for Shared Task 2011 Workshop Lukasz Kaiser, and Illia Polosukhin. 2017. Computational Linguistics, Portland, Oregon, arXiv preprint Attention is all you need. USA, pages 51–55. http://www.aclweb.org/ . arXiv:1706.03762 . anthology/W11-1808 Patrick Verga, David Belanger, Emma Strubell, Sebastian Riedel, Limin Yao, and Andrew McCal- Benjamin Roth, and Andrew McCallum. 2016a. lum. 2010. Modeling relations and their men- Multilingual relation extraction using composi- tions without labeled text. Machine learning and tional universal schema. In Proceedings of the pages 148–163. knowledge discovery in databases 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Sebastian Riedel, Limin Yao, Andrew McCallum, Human Language Technologies . Association for and Benjamin M Marlin. 2013. Relation ex- Computational Linguistics, San Diego, Califor- traction with matrix factorization and universal nia, pages 886–896. http://www.aclweb.org/ schemas. In . pages Proceedings of NAACL-HLT anthology/N16-1103 . 74–84. Patrick Verga, David Belanger, Emma Strubell, Isabel Segura-Bedmar, Paloma Martínez, and Benjamin Roth, and Andrew McCallum. 2016b. María Herrero Zazo. 2013. Semeval-2013 task Multilingual relation extraction using compo- 9 : Extraction of drug-drug interactions from sitional universal schema. In Proceedings of biomedical texts (ddiextraction 2013). In Sec- . pages 886–896. NAACL-HLT ond Joint Conference on Lexical and Computa- Patrick Verga and Andrew McCallum. 2016. Row- tional Semantics (*SEM), Volume 2: Proceedings less universal schema. In Proceedings of the 5th of the Seventh International Workshop on Se- . Association mantic Evaluation (SemEval 2013) Workshop on Automated Knowledge Base Con- . Association for Computational Lin- for Computational Linguistics, Atlanta, Georgia, struction http://www.aclweb.org/ USA, pages 341–350. guistics, San Diego, CA, pages 63–68. http: . anthology/S13-2056 //www.aclweb.org/anthology/W16-1312 . 882

12 Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. Pubtator: a web-based text mining tool Nucleic Acids Research for assisting biocuration. . 41. https://doi.org/10.1093/nar/gkt441 Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Al- lan Peter Davis, Carolyn J Mattingly, Jiao Li, Thomas C Wiegers, and Zhiyong Lu. 2016. As- sessing the state of the art in biomedical rela- tion extraction: overview of the biocreative v chemical-disease relation (cdr) task. Database 2016. Yadollah Yaghoobzadeh, Heike Adel, and Hinrich Schütze. 2017. Noise mitigation for neural en- tity typing and relation extraction. In Proceed- ings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers . Associ- ation for Computational Linguistics, Valencia, Spain, pages 1183–1194. http://www.aclweb. org/anthology/E17-1111 . Bishan Yang, Wen-tau Yih, Xiaodong He, Jian- feng Gao, and Li Deng. 2015. Embedding enti- ties and relations for learning and inference in knowledge bases. In 3rd International Confer- ence for Learning Representations (ICLR) . San Diego, California, USA. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation ex- traction via piecewise convolutional neural net- Proceedings of the 2015 Conference works. In on Empirical Methods in Natural Language Pro- . Association for Computational Linguis- cessing http: tics, Lisbon, Portugal, pages 1753–1762. //aclweb.org/anthology/D15-1203 . Huiwei Zhou, Huijie Deng, Long Chen, Yunlong Yang, Chen Jia, and Degen Huang. 2016a. Ex- ploiting syntactic and semantics information for chemical–disease relation extraction. Database 2016. Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016b. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 2: Short Papers) . Association for Computational Linguistics, Berlin, Germany, pages 207–212. http://anthology.aclweb.org/P16-2034 . 883

13 A Implementation Details A.3 Full CTD Dataset We tune separate decision boundaries for each re- The model is implemented in Tensorflow (Abadi lation type on the development set. For each pre- et al., 2015) and trained on a single TitanX gpu. diction, the relation type with the maximum prob- The number of transformer block repeats is B = 2 . ability is assigned. If the probability is below the We optimize the model using Adam (Kingma and relation specific threshold, the prediction is set to , β  Ba, 2015) with best parameters chosen for , 1 NULL. We use embedding dimension 128 with all chosen from the development set. The learning β 2 embeddings randomly initialized. Our byte pair rate is set to 0 and batch size 32. In all of our . 0005 encoding vocabulary is constructed with a budget experiments we set the number of attention heads of 50,000. Models took 1 to 2 days to train. = 4 h . to β β was set to 1e-4,  to 0.9. Gradi- to .1, and 1 2 We clip the gradients to norm 10 and apply noise . 1 .Dropout was applied to the word ent noise η = to the gradients (Neelakantan et al., 2015). We embeddings with keep probability 0.95, internal tune the decision threshold for each relation type layers with 0.95 and final bilinear projection with separately and perform early stopping on the devel- 0.5 opment set. We apply dropout (Srivastava et al., 2014) to the input layer randomly replacing words with a special UNK token with keep probability . 85 . We additionally apply dropout to the input (word T embedding + position embedding), interior layers, and final state. At each step, we randomly sample a positive or negative (NULL class) minibatch with 5 probability 0 . . A.1 Chemical Disease Relations Dataset Token embeddings are pre-trained using skipgram (Mikolov et al., 2013) over a random subset of 10% of all PubMed abstracts with window size 10 and 20 negative samples. We merge the train and devel- opment sets and randomly take 850 abstracts for training and 150 for early stopping. Our reported results are averaged over 10 runs and using different splits. All baselines train on both the train and development set. Models took between 4 and 8 hours to train. was set to 1e-4, β to .1, and β to 0.9. Gradi-  2 1 . 1 . Dropout was applied to the word ent noise η = embeddings with keep probability 0.85, internal lay- ers with 0.95 and final bilinear projection with 0.35 for the standard CRD dataset experiments. When adding the additional weakly labeled data: word embeddings with keep probability 0.95, internal layers with 0.95 and final bilinear projection with 0.5. A.2 Chemical Protein Relations Dataset We construct our byte-pair encoding vocabulary using a budget of 7500. The dataset contains an- notations for a larger set of relation types than are used in evaluation. We train on only the relation types in the evaluation set and set the remaining types to the Null relation. The embedding dimen- sion is set to 200 and all embeddings are randomly  was set to 1e-8, β to .1, and β to 0.9. initialized. 1 2 η Gradient noise . Dropout was applied to the . = 1 0 word embeddings with keep probability 0.5, internal layers with 1.0 and final bilinear projection with 0.85 for the standard CRD dataset experiments. 884

Related documents

Somalia

Somalia

The Effort to Save Somalia August 1992-March 1994 Walter S. Poole Joint History Office Office of the Chairman of the Joint Chiefs of Staff

More info »
IFC Performance Standards

IFC Performance Standards

IFC Performance Standards on Environmental and Social Sustainability Effective January 1, 2012

More info »
UNSCEAR 2008 Report Vol.I

UNSCEAR 2008 Report Vol.I

This publication contains: VOLUME I: SOURCES SOURCES AND EFFECTS Report of the United Nations Scientific Committee on the Effects of Atomic Radiation to the General Assembly OF IONIZING RADIATION Scie...

More info »
MG Vitamin D Deficiency Testing

MG Vitamin D Deficiency Testing

Deficiency Testing Vitamin D C March 9 Number: MG.MM . LA . 43 a , 201 Last Review Date: 8 Medical Guideline Disclaimer Property of EmblemHealth. All rights reserved. The clinical review criteria expr...

More info »
cluster

cluster

CORRECTED OF: TRANSA CTIONS ON INFORMA TION THEOR Y, VOL. 51, NO 4, APRIL 2005, 1523–1545 1 VERSION IEEE Compression Clustering by ́ Paul M.B. Vit Rudi anyi Cilibrasi and We present a new method for c...

More info »
catalog modified

catalog modified

25 2 P21 Width: .062 Thickness: .055 Gauge: 16 Crown: 1 (25.4) applications only. in soft wood **Note: 2" staples (N21) ". PW-2" drives 1" to 2" 4 / 1 2 N21 4 / 3 " to 1 4 1 / P19 3 16 / 9 Y09 4 / 3 1...

More info »
inline supplementary material 1

inline supplementary material 1

Suppl ement 1 of version of Manchest er Tri age System to redirect Suitability the German erg ency Departm ent pati ents to general practition er care – a Em prosp ectiv e cohort study Thompson, Slagm...

More info »
Local Coverage Determination for Vitamin D Assay Testing (L34658)

Local Coverage Determination for Vitamin D Assay Testing (L34658)

Group 1 Codes: 82306 VITAMIN D; 25 HYDROXY, INCLUDES FRACTION(S), IF PERFORMED VITAMIN D; 1, 25 DIHYDROXY, INCLUDES FRACTION(S), IF PERFORMED 82652 ICD-10 Codes that Support Medical Necessity Group 1 ...

More info »
Caregiving in the U.S. – AARP 2015 Report

Caregiving in the U.S. – AARP 2015 Report

RESEARCH REPORT RESEARCH REPORT JUNE 2015 2015 Report Caregiv ing in t he U.S. Conducted by

More info »
untitled

untitled

Provider e-Manual

More info »
inline supplementary material 1

inline supplementary material 1

Supplementary Material able S1 T v2 and the Corresponding ICD - 9 and ICD - 10 Diagnosis and Procedure Chronic Care Conditions (CCC) . Categories of Codes : Feudtner C, Feinstein JA, Zhong W, Hall M, ...

More info »
Microsoft Word   FIWON   The Poor Also Must Live 2016.docx

Microsoft Word FIWON The Poor Also Must Live 2016.docx

“The Poor Also M ust Live!” Market Demolition, Gentrification and the Quest for Survival in Lagos State. October 2016

More info »
Catalog.PDF

Catalog.PDF

THE SMD CODEBOOK SMD Codes. SMD devices are, by their very nature, too small to carry conventional semiconductor type numbers. Instead, a somewhat arbitrary coding system has grown up, where - or thre...

More info »
Microsoft Word   WadeDraftThesis.docx

Microsoft Word WadeDraftThesis.docx

UCSF UC San Francisco Electronic Theses and Dissertations Title Drug Transport Across Ocular Epithelial Tissue Using Micro, Nano and Surface Modified Polymer Devices Permalink https://escholarship.org...

More info »
v-dem democracy_report_2018

v-dem democracy_report_2018

INSTITUTE Varieties of Democracy Democracy for All? V-Dem Annu Al Democr Acy r eport 2018

More info »
e029211.full

e029211.full

Research Open access BMJ Open: first published as 10.1136/bmjopen-2019-029211 on 5 May 2019. Downloaded from Prevalence and incidence of kidney diseases leading to hospital admission in people living ...

More info »