Web Corpus Construction

Web Corpus Construction Book in PDF, ePub and Kindle version is available to download in english. Read online anytime anywhere directly from your device. Click on the download button below to get a free pdf file of Web Corpus Construction book. This book definitely worth reading, it is an incredibly well-written.

Web Corpus Construction

Author : Roland Schäfer
Publisher : Springer Nature
Page : 129 pages
File Size : 48,5 Mb
Release : 2022-05-31
Category : Computers
ISBN : 9783031021527

Get Book

Web Corpus Construction by Roland Schäfer Pdf

The World Wide Web constitutes the largest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. There are several adavantages of this approach: (i) Working with such corpora obviates the problems encountered when using Internet search engines in quantitative linguistic research (such as non-transparent ranking algorithms). (ii) Creating a corpus from web data is virtually free. (iii) The size of corpora compiled from the WWW may exceed by several orders of magnitudes the size of language resources offered elsewhere. (iv) The data is locally available to the user, and it can be linguistically post-processed and queried with the tools preferred by her/him. This book addresses the main practical tasks in the creation of web corpora up to giga-token size. Among these tasks are the sampling process (i.e., web crawling) and the usual cleanups including boilerplate removal and removal of duplicated content. Linguistic processing and problems with linguistic processing coming from the different kinds of noise in web corpora are also covered. Finally, the authors show how web corpora can be evaluated and compared to other corpora (such as traditionally compiled corpora). For additional material please visit the companion website: sites.morganclaypool.com/wcc Table of Contents: Preface / Acknowledgments / Web Corpora / Data Collection / Post-Processing / Linguistic Processing / Corpus Evaluation and Comparison / Bibliography / Authors' Biographies

Web Corpus Construction

Author : Roland Schäfer,Felix Bildhauer
Publisher : Morgan & Claypool Publishers
Page : 197 pages
File Size : 54,7 Mb
Release : 2013-07-01
Category : Computers
ISBN : 9781627053129

Get Book

Web Corpus Construction by Roland Schäfer,Felix Bildhauer Pdf

The World Wide Web constitutes the largest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. There are several adavantages of this approach: (i) Working with such corpora obviates the problems encountered when using Internet search engines in quantitative linguistic research (such as non-transparent ranking algorithms). (ii) Creating a corpus from web data is virtually free. (iii) The size of corpora compiled from the WWW may exceed by several orders of magnitudes the size of language resources offered elsewhere. (iv) The data is locally available to the user, and it can be linguistically post-processed and queried with the tools preferred by her/him. This book addresses the main practical tasks in the creation of web corpora up to giga-token size. Among these tasks are the sampling process (i.e., web crawling) and the usual cleanups including boilerplate removal and removal of duplicated content. Linguistic processing and problems with linguistic processing coming from the different kinds of noise in web corpora are also covered. Finally, the authors show how web corpora can be evaluated and compared to other corpora (such as traditionally compiled corpora).

Corpus Linguistics and the Web

Author : Anonim
Publisher : BRILL
Page : 311 pages
File Size : 54,6 Mb
Release : 2015-07-14
Category : Language Arts & Disciplines
ISBN : 9789401203791

Get Book

Corpus Linguistics and the Web by Anonim Pdf

Using the Web as Corpus is one of the recent challenges for corpus linguistics. This volume presents a current state-of-the-arts discussion of the topic. The articles address practical problems such as suitable linguistic search tools for accessing the www, the question of register variation, or they probe into methods for culling data from the web. The book also offers a wide range of case studies, covering morphology, syntax, lexis, as well as synchronic and diachronic variation in English. These case studies make use of the two approaches to the www in corpus linguistics – web-as-corpus and web-for-corpus-building. The case studies demonstrate that web data can provide useful additional evidence for a broad range of research questions.

Overcoming Challenges in Corpus Construction

Author : Robbie Love
Publisher : Routledge
Page : 176 pages
File Size : 51,5 Mb
Release : 2020-01-06
Category : Language Arts & Disciplines
ISBN : 9780429771095

Get Book

Overcoming Challenges in Corpus Construction by Robbie Love Pdf

This volume offers a critical examination of the construction of the Spoken British National Corpus 2014 (Spoken BNC2014) and points the way forward toward a more informed understanding of corpus linguistic methodology more broadly. The book begins by situating the creation of this second corpus, a compilation of new, publicly-accessible Spoken British English from the 2010s, within the context of the first, created in 1994, talking through the need to balance backward capability and optimal practice for today’s users. Chapters subsequently use the Spoken BNC2014 as a focal point around which to discuss the various considerations taken into account in corpus construction, including design, data collection, transcription, and annotation. The volume concludes by reflecting on the successes and limitations of the project, as well as the broader utility of the corpus in linguistic research, both in current examples and future possibilities. This exciting new contribution to the literature on linguistic methodology is a valuable resource for students and researchers in corpus linguistics, applied linguistics, and English language teaching.

Web As Corpus

Author : Maristella Gatto
Publisher : A&C Black
Page : 255 pages
File Size : 44,7 Mb
Release : 2014-02-13
Category : Language Arts & Disciplines
ISBN : 9781441134134

Get Book

Web As Corpus by Maristella Gatto Pdf

Is the internet a suitable linguistic corpus? How can we use it in corpus techniques? What are the special properties that we need to be aware of? This book answers those questions. The Web is an exponentially increasing source of language and corpus linguistics data. From gigantic static information resources to user-generated Web 2.0 content, the breadth and depth of information available is breathtaking – and bewildering. This book explores the theory and practice of the “web as corpus”. It looks at the most common tools and methods used and features a plethora of examples based on the author's own teaching experience. This book also bridges the gap between studies in computational linguistics, which emphasize technical aspects, and studies in corpus linguistics, which focus on the implications for language theory and use.

Developing Linguistic Corpora

Author : Martin Wynne
Publisher : Oxbow Books Limited
Page : 100 pages
File Size : 53,8 Mb
Release : 2005
Category : Language Arts & Disciplines
ISBN : UVA:X004991162

Get Book

Developing Linguistic Corpora by Martin Wynne Pdf

A linguistic corpus is a collection of texts which have been selected and brought together so that language can be studied on the computer. Today, corpus linguistics offers some of the most powerful new procedures for the analysis of language, and the impact of this dynamic and expanding sub-discipline is making itself felt in many areas of language study. In this volume, a selection of leading experts in various key areas of corpus construction offer advice in a readable and largely non-technical style to help the reader to ensure that their corpus is well designed and fit for the intended purpose. This guide is aimed at those who are at some stage of building a linguistic corpus. Little or no knowledge of corpus linguistics or computational procedures is assumed, although it is hoped that more advanced users will find the guidelines here useful. It is also aimed at those who are not building a corpus, but who need to know something about the issues involved in the design of corpora in order to choose between available resources and to help draw conclusions from their studies.

WaCky!

Author : Marco Baroni,Silvia Bernardini
Publisher : Gedit
Page : 238 pages
File Size : 50,6 Mb
Release : 2006
Category : Computers
ISBN : STANFORD:36105121435536

Get Book

WaCky! by Marco Baroni,Silvia Bernardini Pdf

Essential Speech and Language Technology for Dutch

Author : Peter Spyns,Jan Odijk
Publisher : Springer Science & Business Media
Page : 414 pages
File Size : 50,5 Mb
Release : 2013-02-26
Category : Language Arts & Disciplines
ISBN : 9783642309106

Get Book

Essential Speech and Language Technology for Dutch by Peter Spyns,Jan Odijk Pdf

The book provides an overview of more than a decade of joint R&D efforts in the Low Countries on HLT for Dutch. It not only presents the state of the art of HLT for Dutch in the areas covered, but, even more importantly, a description of the resources (data and tools) for Dutch that have been created are now available for both academia and industry worldwide. The contributions cover many areas of human language technology (for Dutch): corpus collection (including IPR issues) and building (in particular one corpus aiming at a collection of 500M word tokens), lexicology, anaphora resolution, a semantic network, parsing technology, speech recognition, machine translation, text (summaries) generation, web mining, information extraction, and text to speech to name the most important ones. The book also shows how a medium-sized language community (spanning two territories) can create a digital language infrastructure (resources, tools, etc.) as a basis for subsequent R&D. At the same time, it bundles contributions of almost all the HLT research groups in Flanders and the Netherlands, hence offers a view of their recent research activities. Targeted readers are mainly researchers in human language technology, in particular those focusing on Dutch. It concerns researchers active in larger networks such as the CLARIN, META-NET, FLaReNet and participating in conferences such as ACL, EACL, NAACL, COLING, RANLP, CICling, LREC, CLIN and DIR ( both in the Low Countries), InterSpeech, ASRU, ICASSP, ISCA, EUSIPCO, CLEF, TREC, etc. In addition, some chapters are interesting for human language technology policy makers and even for science policy makers in general.

Corpus Linguistics

Author : Tony McEnery,Andrew Hardie
Publisher : Cambridge University Press
Page : 128 pages
File Size : 40,9 Mb
Release : 2011-10-06
Category : Language Arts & Disciplines
ISBN : 9781139502443

Get Book

Corpus Linguistics by Tony McEnery,Andrew Hardie Pdf

Corpus linguistics is the study of language data on a large scale - the computer-aided analysis of very extensive collections of transcribed utterances or written texts. This textbook outlines the basic methods of corpus linguistics, explains how the discipline of corpus linguistics developed and surveys the major approaches to the use of corpus data. It uses a broad range of examples to show how corpus data has led to methodological and theoretical innovation in linguistics in general. Clear and detailed explanations lay out the key issues of method and theory in contemporary corpus linguistics. A structured and coherent narrative links the historical development of the field to current topics in 'mainstream' linguistics. Practical tasks and questions for discussion at the end of each chapter encourage students to test their understanding of what they have read and an extensive glossary provides easy access to definitions of technical terms used in the text.

Cyber-Physical Systems for Social Applications

Author : Dimitrova, Maya,Wagatsuma, Hiroaki
Publisher : IGI Global
Page : 440 pages
File Size : 40,6 Mb
Release : 2019-04-03
Category : Computers
ISBN : 9781522578802

Get Book

Cyber-Physical Systems for Social Applications by Dimitrova, Maya,Wagatsuma, Hiroaki Pdf

Present day sophisticated, adaptive, and autonomous (to a certain degree) robotic technology is a radically new stimulus for the cognitive system of the human learner from the earliest to the oldest age. It deserves extensive, thorough, and systematic research based on novel frameworks for analysis, modelling, synthesis, and implementation of CPSs for social applications. Cyber-Physical Systems for Social Applications is a critical scholarly book that examines the latest empirical findings for designing cyber-physical systems for social applications and aims at forwarding the symbolic human-robot perspective in areas that include education, social communication, entertainment, and artistic performance. Highlighting topics such as evolinguistics, human-robot interaction, and neuroinformatics, this book is ideally designed for social network developers, cognitive scientists, education science experts, evolutionary linguists, researchers, and academicians.

Forensic Authorship Analysis and the World Wide Web

Author : S. Larner
Publisher : Springer
Page : 76 pages
File Size : 40,8 Mb
Release : 2014-10-21
Category : Language Arts & Disciplines
ISBN : 9781137413758

Get Book

Forensic Authorship Analysis and the World Wide Web by S. Larner Pdf

Implementing a novel method for identifying idiolectal co-selections, and taking the UNABOM investigation as a case study, this Pivot evaluates the effectiveness and reliability of using the web for forensic purposes.

Building and Exploring Web Corpora (WAC3 - 2007)

Author : Cédrick Fairon
Publisher : Presses univ. de Louvain
Page : 186 pages
File Size : 53,5 Mb
Release : 2007
Category : Language Arts & Disciplines
ISBN : 2874630829

Get Book

Building and Exploring Web Corpora (WAC3 - 2007) by Cédrick Fairon Pdf

WAC More and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of the talks presented at the 3 rd WAC in Louvain-la-Neuve (Belgium).The focus is on the description of Web corpus collection projects, the exploration of Web datacharacteristics from a linguistics/NLP perspective, and on the use of crawled Web data for NLPpurposes. CLEANEVAL Any use of Web data requires that it be cleaned in order to get rid of unwanted material including,for example, HTML markup, navigation bars, advertisements. To date there has been no sharingof resources or expertise in this particular domain and the cleaning has often been done minimally.Cleaneval was an exercise aimed at promoting collaboration and improving our understandingof the issues. Results and perspectives are presented in this book.

Building and Using Comparable Corpora for Multilingual Natural Language Processing

Author : Serge Sharoff,Reinhard Rapp,Pierre Zweigenbaum
Publisher : Springer Nature
Page : 138 pages
File Size : 46,9 Mb
Release : 2023-08-23
Category : Computers
ISBN : 9783031313844

Get Book

Building and Using Comparable Corpora for Multilingual Natural Language Processing by Serge Sharoff,Reinhard Rapp,Pierre Zweigenbaum Pdf

This book provides a comprehensive overview of methods to build comparable corpora and of their applications, including machine translation, cross-lingual transfer, and various kinds of multilingual natural language processing. The authors begin with a brief history on the topic followed by a comparison to parallel resources and an explanation of why comparable corpora have become more widely used. In particular, they provide the basis for the multilingual capabilities of pre-trained models, such as BERT or GPT. The book then focuses on building comparable corpora, aligning their sentences to create a database of suitable translations, and using these sentence translations to produce dictionaries and term banks. Then, it is explained how comparable corpora can be used to build machine translation engines and to develop a wide variety of multilingual applications.

AI 2008: Advances in Artificial Intelligence

Author : Wayne Wobcke,Mengjie Zhang
Publisher : Springer Science & Business Media
Page : 631 pages
File Size : 45,8 Mb
Release : 2008-11-13
Category : Computers
ISBN : 9783540893776

Get Book

AI 2008: Advances in Artificial Intelligence by Wayne Wobcke,Mengjie Zhang Pdf

This book constitutes the refereed proceedings of the 21th Australasian Joint Conference on Artificial Intelligence, AI 2008, held in Auckland, New Zealand, in December 2008. The 42 revised full papers and 21 revised short papers presented together with 1 invited lecture were carefully reviewed and selected from 143 submissions. The papers are organized in topical sections on knowledge representation, constraints, planning, grammar and language processing, statistical learning, machine learning, data mining, knowledge discovery, soft computing, vision and image processing, and AI applications.

The MIHI EST construction

Author : Mihaela Ilioaia
Publisher : Walter de Gruyter GmbH & Co KG
Page : 413 pages
File Size : 48,8 Mb
Release : 2023-12-18
Category : Foreign Language Study
ISBN : 9783111055626

Get Book

The MIHI EST construction by Mihaela Ilioaia Pdf

This book examines the Romanian mihi est construction (Mi-e foame/frică, me.dat = is hunger/fear ‘I am hungry/ afraid’). While it disappeared from all other Romance languages to be replaced with a habeo structure, the mihi est pattern is in Romanian the most common way of expressing psychological or physiological states. By means of synchronic and diachronic corpus studies, the book investigates the status of the core arguments of the mihi est structure, i.e. the dative experiencer and the nominative state noun, as well as its evolution throughout the centuries. The data analysis reveals that the dative experiencer syntactically behaves like nominative subjects, whereas the state noun shows predicate behavior. As for the evolution of the mihi est structure, the analysis shows a certain tendency toward innovation, since in present-day Romanian it can coerce nouns coming from other semantic fields into the construction’s psychological or physiological interpretation. Could this be another unique trait of Romanian, which causes it to seemingly go against the tendency of most Romance languages toward canonical marking of core arguments?